Removing Chimeras

lande...@greenmtn.edu

unread,

Oct 21, 2017, 9:36:00 AM10/21/17

to Qiime 1 Forum

I'm trying to run Chimera Slayer but getting an error message that I think means that the program is not installed. I'm using Qiime 1.9 on an Amazon EC2 instance. My preference is to use UChime but my dataset is too large for the 32 bit version (22 million sequences). Below is the command I ran, the error message and the result of print_qiime_config. Any suggestions on how to get Chimera Slayer to work or on other programs for finding Chimeras? Thanks for your help!

Bill

**Command that I ran**

identify_chimeric_seqs.py -i /PATH/seqs.fna -a /PATH/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta -o /OUTPUT/ -m ChimeraSlayer

**Error message**

Traceback (most recent call last):
File "/usr/local/bin/identify_chimeric_seqs.py", line 354, in <module>
    main()
File "/usr/local/bin/identify_chimeric_seqs.py", line 328, in main
    keep_intermediates=keep_intermediates)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 159, in chimeraSlayer_identify_chimeras
    keep_intermediates=keep_intermediates):
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 143, in __call__
    keep_intermediates=keep_intermediates)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 637, in get_chimeras_from_Nast_aligned
    app_results = app()
File "/usr/local/lib/python2.7/dist-packages/burrito/util.py", line 295, in __call__
    result_paths = self._get_result_paths(data)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 419, in _get_result_paths
    raise ApplicationError("Calling ChimeraSlayer failed.")
burrito.util.ApplicationError: Calling ChimeraSlayer failed.

**Results of print_qiime_config.py**

System information
==================
         Platform:    linux2
   Python version:    2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3]
Python executable:    /usr/bin/python

QIIME default reference information
===================================
For details on what files are used as QIIME's default references, see here:
https://github.com/biocore/qiime-default-reference/releases/tag/0.1.2

Dependency versions
===================
          QIIME library version:    1.9.1
           QIIME script version:    1.9.1
qiime-default-reference version:    0.1.2
                  NumPy version:    1.9.2
                  SciPy version:    0.15.1
                 pandas version:    0.16.1
             matplotlib version:    1.4.3
            biom-format version:    2.1.4
                   h5py version:    2.5.0 (HDF5 version: 1.8.4)
                   qcli version:    0.1.1
                   pyqi version:    0.3.2
             scikit-bio version:    0.2.3
                 PyNAST version:    1.2.2
                Emperor version:    0.9.51
                burrito version:    0.9.1
       burrito-fillings version:    0.1.1
              sortmerna version:    SortMeRNA version 2.0, 29/11/2014
              sumaclust version:    SUMACLUST Version 1.0.00
                  swarm version:    Swarm 1.2.19 [May 26 2015 15:28:37]
                          gdata:    Installed.

QIIME config values
===================
For definitions of these settings and to learn how to configure QIIME, see here:
http://qiime.org/install/qiime_config.html
http://qiime.org/tutorials/parallel_qiime.html

                     blastmat_dir:    /qiime_software/blast-2.2.22-release/data
      pick_otus_reference_seqs_fp:    /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                         sc_queue:    all.q
      topiaryexplorer_project_dir:    None
     pynast_template_alignment_fp:    /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta
                  cluster_jobs_fp:    start_parallel_jobs.py
pynast_template_alignment_blastdb:    None
assign_taxonomy_reference_seqs_fp:    /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                     torque_queue:    friendlyq
                    jobs_to_start:    1
                       slurm_time:    None
            denoiser_min_per_core:    50
assign_taxonomy_id_to_taxonomy_fp:    /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
                         temp_dir:    /home/ubuntu/temp/
                     slurm_memory:    None
                      slurm_queue:    None
                      blastall_fp:    /qiime_software/blast-2.2.22-release/bin/blastall
                 seconds_to_sleep:    1

Colin Brislawn

unread,

Oct 21, 2017, 1:04:02 PM10/21/17

to Qiime 1 Forum

Hello Bill,

Any suggestions on how to get Chimera Slayer to work or on other programs for finding Chimeras?

Yes! I really like the program vsearch. It implements the uchime algorithm, just like usearch does, but vsearch is 64-bit and fully open source.

You can learn more about vsearch here: https://github.com/torognes/vsearch

You can install it with: conda install vsearch

Colin

lande...@greenmtn.edu

unread,

Oct 22, 2017, 2:54:15 PM10/22/17

to Qiime 1 Forum

Thanks Colin! From the user manual I think I have the general idea of how it works but it seems that running on my full dataset (22 million sequences) will take days or weeks. I tried dereplicating but it's not clear how I would use any chimeras found to refer back to the original dataset, because it converts to fasta format and loses the sequence-specific information. Any suggestions on a general overview of what a command would look like or where I could find some additional tips? Sorry if this is a little off topic.

Bill

Colin Brislawn

unread,

Oct 23, 2017, 11:32:06 AM10/23/17

to Qiime 1 Forum

Hello Bill,

You are absolutely correct. Uchime is designed to be run on your dereplicated reads, which include the needed abundance annotation, but lack sample specific labels. While vsearch / uchime will be elegantly integrated in qiime 2, qiime doesn't include a great way to do this.

One option is to use vsearch directly for chimera detection and OTU picking, then use qiime for downstream analysis. This method does not come with an elegant workflow script, but it's pretty fast and takes a few more steps. Here is one example of how I used this on a past project. See scripts 4, 5, & 6.

https://github.com/pnnl/bernstein-2016-productivity-and-diversity/tree/master/analysis/scripts

Have you considered using qiime 2? While it's still in development, it will offer greater flexibility and better interfaces with programs like vsearch. It might be worth checking out.

Colin

lande...@greenmtn.edu

unread,

Oct 24, 2017, 9:12:35 PM10/24/17

to Qiime 1 Forum

Hi Colin,

Thanks again! Yes, I plan to switch to Qiime 2 and am looking forward to all the new features. This issue with Chimeras might be the final motivation to make the switch!

I mostly got your recommended procedure to work but got stuck on the last step: uc_to_otu.py. That doesn't seem to be a Qiime or vsearch command. Any more tips you can send my way?

I was surprised by how much faster the OTU picking went and it seems that there are dramatically fewer representative sequences: 1,100 from vsearch vs. 31,000 from usearch (via Qiime). Does that sound right and what could account for this difference? The chimeras were about 2.3% of the dataset so I assume that does not account for this large drop. Thanks again Colin, your help is greatly appreciated.

Bill

Colin Brislawn

unread,

Oct 25, 2017, 9:23:58 AM10/25/17

to Qiime 1 Forum

Good morning Bill,

I'm glad you got this working! Yes, fully de novo OTU picking will make fewer OTUs than the default qiime method which is open-reference. The accuracy of the vsearch alignment could also account for this; usearch performs heuristic alignments while vsearch performs exact / optimal alignments, leading to greater accuracy.

I should have mentioned uc_to_otu.py sooner. This final command converts the -usearch_global hits to a otu.txt table. You can currently do that using the biom-format package and this command

biom from-uc -i usearch_global.uc -o otus.txt

Or you can use this python file

https://github.com/leffj/helper-code-for-uparse/blob/master/create_otu_table_from_uc_file.py

Let me know how well that script works for you,

Colin

lande...@greenmtn.edu

unread,

Oct 25, 2017, 10:05:00 AM10/25/17

to Qiime 1 Forum

This has been a really useful exercise and a real eye-opener, as it is forcing me to re-think my procedures. Ultimately I think the cleanest option is to re-run everything through Qiime 2, as discussed. Let me ask you this, but I realize it might be a tricky question to answer: Vsearch found 2.3% chimeric sequences in the dereplicated dataset of about 800,000 sequences (the original file had 22 million). Setting vsearch aside for the moment, I used Qiime to finish the analysis without dereplication and without removing Chimeras. However, I removed all singletons, doubletons and all sequences with a classification of "unassigned" from the OTU table. Do you think this procedure is likely to remove most of the chimeras? Don't worry, I'm not going to publish it this way but it would be helpful for now to have a sense of how much I should expect my results to change.

I'll let you know how the final step of the vsearch process goes. Thanks again!

Colin Brislawn

unread,

Oct 25, 2017, 1:07:20 PM10/25/17

to Qiime 1 Forum

Good morning Bill,

Identifying chimeras accurately requires a truth data set, with known microbes. You can then identify which amplicons are coming from the microbes, and which amplicons are Illumina adapters, PCR errors, or chimeras (which I guess are PCR errors too!).

Without these samples in your cohort, it's hard to evaluate how close your samples are to perfect.

One known issue with Qiime 1 is OTU inflation. Quality filtering the initial reads along with chimera detection and removal of singletons helps with this, but it's a standing issue. While getting 2000+ extra OTUs sounds really bad, in practice it's not a major issue. The extra OTUs are often just errors of the more abundant real OTUs (over splitting during clustering), so weighted metrics will prioritize the true OTUs over the error filled ones. The UniFrac distance metric is also robust to over splitting, so community comparison ends up being very similar when run with the older methods.

I'm glad you found this issue. Having these moments which force you to question the underlying methods really helps deepen the understanding of the field.

Let me know what you find next!

Colin

lande...@greenmtn.edu

unread,

Oct 25, 2017, 2:53:46 PM10/25/17

to Qiime 1 Forum

I need to bug you about one additional issue: I'm getting a message that "from-uc" is not a recognized command ("Unrecognized command from-uc") but my biom software seems to be up to date.

Colin Brislawn

unread,

Oct 25, 2017, 3:32:02 PM10/25/17

to Qiime 1 Forum

Strange. Works for me with this version

biom --version

biom, version 2.1.5

What version of biom are you using? You may need the newer one (the version that is compatible with qiime2).

Colin

Reply all

Reply to author

Forward