Chimera check and Singleton removal

Abhi genobio

unread,

Apr 8, 2016, 7:16:33 AM4/8/16

to Qiime 1 Forum

Dear Users,

1. I wanted to perform chimera check on my query sequence before otu picking. I have 50 samples and the file size is 7Gb.
when i executed using usearch61 it shows fatal error since the datasize is more to work in 32 bit. In such case how should i proceed.
what is the correct approach to perform chimera check?

2. secondly, if i did chimera check using usearch method is it correct to perform closed ref otu picking with default uclust_ref method or should i have to do otu picking using usearch method?

3. I performed closed ref otu picking using uclust_ref method and generated a biom file. I perform singleton filtering with n=2.
After that I splited the biom file and wanted to remove the otus from each sample whose abundance value is < 10.
How should i do so? And is this a correct approach or not to remove otus based on abundance ?

4. What is the use of --min_count_fraction while executing filter_otus_from_otu_table.py? which is appropriate method to remove singletons or executing with --min_count_fraction. Can anyone explain the advantages of using --min_count_fraction and how is it calculated.

5.I ran closed ref otu pick method on 50 samples with all default parameters. Each of my samples has 50k to 0.2M reads. The biom summary gives counts/samples is within a range of 4k to 25k for 50% of the samples. sample1 has 0.2M reads but the counts/sample is only 4000.
How should i troubleshoot such scenarios where I am loosing almost 98% of my reads.

6. After doing summarizing taxa i found in many otus k__Bacteria; p__Acidobacteria; c__[Chloracidobacteria]; o__11-24; f__; g__; s__. Where the genus and the species are not known.When we perform alpha diversity on my biom file with observed_species diversity index it considers such OTUs. But actually there are not species present. How do we justify it? Are these OTUIDs supposed to be considered or we should discard such otus before doing alpha diversity on the biom files? If I want to perfrom alpha diversity at genus level only, how should I proceed? Can the unknown genus or species be termed as unclassified genus or species or its unassigned genus/species? Which is technically correct to be termed?

Inputs will be highly appreciated.

Thanks

jonsan

unread,

Apr 12, 2016, 1:01:31 PM4/12/16

to Qiime 1 Forum

Hi Abhi,

Some responses from the Qiime developers follow:

1. You will need the 64 bit version of usearch61 for large datasets. An alternative option is vsearch, which will run the same way in QIIME as usearch61, but is free and 64 bit. https://github.com/torognes/vsearch

2. With closed-reference OTU picking, I would avoid chimera checking, as you are already discarding reads that fail to match the Greengenes data. If you still wanted to do chimera checking, I would do chimera checking after OTU picking with ChimeraSlayer, or before OTU picking with usearch61, as described here: http://qiime.org/tutorials/chimera_checking.html

3. filter_otus_from_otu_table.py would be the script to filter out low abundance OTUs after one already has an OTU table. Singleton removal has been a standard for a while, but it is arbitrary. Depending upon the sequencing technology, removing larger OTUs as likely noise can certainly be justified, as even if said OTUs are real and not sequencing noise/chimeras, they are unlikely to be significant players in the microbial community.

4. min_count_fraction would be used if you wanted to have the script calculate where to filter based up the overall relative abundance of a given OTU, rather than arbitrary value like 2 or 10. This can make filtering to just retain the major taxa.

5. You might try open reference OTU picking to retain more reads, if the taxa present are novel or not represented in the Greengenes data.

6. The OTUs can still be present, even without a given species (or genus/family/order) name, and a tree can be built based upon the 16S read. I wouldn't remove the OTUs, as they are probably real taxa, and should be included in the phylogenetic metrics like PD. You can do diversity metrics at the genus/family or other level from a collapsed table (from summarize_taxa.py), but you can't use a metric that depends upon a tree (like PD or UniFrac). Referring to the unclassified or uncultured taxa as such is acceptable. Unless there is a significant difference in these uncultured taxa between your samples, you probably don't have to worry a great deal about describing them (journals will probably start stronger pushes to culture and characterize these unknown taxa, esp. if they are defining the differences between communities though).

Hope that helps!

-jon

Abhi genobio

unread,

Apr 12, 2016, 2:23:46 PM4/12/16

to Qiime 1 Forum

Hi Jonsan,

Thanks so much for answering all my queries so patiently.

As you have mentioned above in the 5th point, I have initiated the analysis using vsearch.

I guess vsearch is still not integrated with qiime to perform open ref otu pick directly with -m option. I read in one of the threads of this forum where they have
suggested to try with vsearch by renaming to usearch61. I am bit confused about it? If so will it perform all the steps as usearch does in open ref method .

Another concern is the time. Which tool will work best to generate the result with less time for 7 gb data.

I followed vsearch pipeline separately and generated a .uc file and then converted it to a biom.

To assign taxonomy, I used

parallel_assign_taxonomy_rdp.py -v --rdp_max_memory 4000 -O 24 -t /gg_13_5/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -r /gg_13_5/gg_13_8_otus/rep_set/97_otus.fasta -i nochimeras.seq.fasta -o assigned_taxonomy 

Do i have to provide any other parameters. This step is taking a lot of time, is there an alternative to do faster for large Gb of data.

2. Which is the correct step for chimera check, before otu pick or after? As per qiime tutorial, singleton removal and chimera check is done after otu pick where as in usearch and vsearch its before? How to decide which is correct?

3.Can we remove singletons after doing a closed ref otu pick or its appropriate for open ref and deno methods?


Many thanks
Abhi

jonsan

unread,

Apr 12, 2016, 2:38:01 PM4/12/16

to Qiime 1 Forum

Hi Abhi,

Do read through Colin's responses in this thread. There is a lot of good information there, though it is following the pipeline steps for uparse-like de novo OTU clustering rather than open-ref OTU picking. I have an ipython notebook in which I did these steps, though am currently having cluster problems and can't access it at the moment.

I've also thought about trying the renaming vsearch to quickly integrate it in to the standard Qiime workflows, but haven't really given it a try yet. The options flags for vsearch are intentionally made to be very similar to usearch, so it might work. You could try renaming usearch61 to something else temporarily and just see if it works. Has anyone else had experience doing this?

Regarding taxonomy assignment, make sure you're doing it on only the representative sequences rather than all the input sequences.

Finally, for chimera removal, the usearch/vsearch algorithm does de-novo chimera filtering as an integral part of the de-novo clustering pipeline. It's a bit more complicated as part of a reference-based pipeline. What is your goal with the analysis? That might guide your OTU picking strategy.

-j

Abhi genobio

unread,

Apr 13, 2016, 3:29:13 PM4/13/16

to Qiime 1 Forum

Hi Jonsan,

The main objective of my study is to find out the diversity of the species across various groups. Each group has a unique set of samples.
Since my data is huge I wanted a method that can do it faster. I thought of doing singleton removal and chimera check for each sample using vsearch and then concatenate all the samples and then do a open ref otu picking using qiime using uclust. I want to know whether will it be a correct approach and will it take less computational time or not?

I read the response of Colin regarding vsearch, I have few doubts in it.

1. Dereplication , singleton removal and chimera check is it supposed to happen samplewise or we can do it in a combined_seq.fna (i.e output from qiime after add qiime label)>

2. How much time does parallel_assign_taxonomy_rdp.py takes to work on 8Gb data? I have used the script as mentioned below
parallel_assign_taxonomy_rdp.py -v --rdp_max_memory 4000 -O 24 -t /gg_13_5/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -r /gg_13_5/gg_13_8_otus/rep_set/97_otus.fasta -i sl_out_miseq_run_02/seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta -o sl_out_miseq_run_02/assigned_taxonomy

3. Can i do a open ref otu pick with default parameters and then do a chimera check and singleton removal? It this approach correct or not?

I suspect it might not. usearch61 with32 bit compatabilitywill not work on my dataset, and chimera check will not happen with default parameters. Any suggestions on parameter change.

I want a pipeline where most of the sequences gets picked up with some OTUs and the dataset should be singleton free and chimera checked. I am getting confused.

Any comments on it.

Many thanks
Abhi

Colin Brislawn

unread,

Apr 13, 2016, 8:21:20 PM4/13/16

to Qiime 1 Forum

These are great questions Abhi. I can answer some of them.

1. Dereplication , singleton removal and chimera check is it supposed to happen samplewise or we can do it in a combined_seq.fna (i.e output from qiime after add qiime label)>

I perform it on the combined_seq.fna of all my reads. You can, however, perform it samplewise.

3. Can i do a open ref otu pick with default parameters and then do a chimera check and singleton removal? It this approach correct or not?

I think that's reasonable. Different people recommend chimera checking before or after OTU picking. Whatever order you choose, make sure you make this step clear in your methods section.

I want a pipeline where most of the sequences gets picked up with some OTUs and the dataset should be singleton free and chimera checked. I am getting confused.

Take a look at the way Robert Edgar describes the uparse pipeline. I find his method to be very clear and a good point of referance when thinking about amplicon pipelines.

The uparse commands: http://www.drive5.com/usearch/manual/uparse_cmds.html

Discussion of the uparse pipeline: http://www.drive5.com/usearch/manual/uparse_pipeline.html

I hope this helps,

Colin

jonsan

unread,

Apr 13, 2016, 8:32:21 PM4/13/16

to Qiime 1 Forum

I echo Colin's thoughts. A few additional comments:

1. Removing singletons from the combined sequence set will be more conservative, as you only remove sequences found once in the whole dataset (rare sequences might be singletons in one sample but well represented in others).

2. The assign taxonomy step depends on how many representative sequences you end up with. As one datapoint, I just looked up an old analysis that did open reference with 32 processors on a dataset that started with a demultiplexed seqs.fna file of about 7.5 Gb, ended up with 190874 representative sequences, and took about 17 minutes to complete taxonomy assignment.

3. You should be able to use the de novo chimera check with vsearch in 64 bit mode to do de novo chimera checking after OTU picking (or in parallel with OTU picking) without any sort of issues. That would be my recommendation.

-j

Abhi genobio

unread,

Apr 14, 2016, 12:23:03 PM4/14/16

to Qiime 1 Forum

Thanks Colin and Jonsan for all the valuable information.

I have one more doubt related to open ref otu pick method

I did closed ref otu pick on my samples and again did open ref otu pick to compare the results since a lot of my sequences were not assigned
any otu while doing closed ref otu pick method.
Though I know the results of closed and open will definitely never be the same but I have a confusion.
The abundance count value of the top species identified in closed ref otu pick for 4 samples is a mentioned below
k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Piscirickettsiaceae;g__Thiomicrospira;s__frisia 9792.0 20986.0 2.0 19260.0

wheras when we did the open reference otu pick, the abundance of the same otu which is found most abundant in closed ref otu pick is different i.e

k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Piscirickettsiaceae;g__Thiomicrospira;s__frisia 420.0 0.0 573.0 172.0

My doubt is, open ref otu pick also does a reference based search at the very 1st step (Step1). So the abundance count should have a minimum value as the closed ref rather it can be more but not less than what is been picked up in step1. How can the count value be lower in open ref as mentioned above compared to closed ref

?
Inputs will be highly appreciated.

Many Thanks
Abhi

jonsan

unread,

Apr 14, 2016, 3:07:22 PM4/14/16

to Qiime 1 Forum

Hi Abhi,

That does look like unusual behavior. How exactly are you getting those values? Is it from a taxa summarization, or directly from the OTU table?

-jon

Abhi genobio

unread,

Apr 15, 2016, 2:10:04 AM4/15/16

to Qiime 1 Forum

Hi Jonson,

The information has been taken from taxa summarization.

The script which I used for open reference otu pick is as mentioned below.

pick_open_reference_otus.py -i combined_seqs.fna -r /gg_13_8_otus/rep_set/97_otus.fasta -o otu_open/ -p params.txt

For summarize taxa I use

summarize_taxa_through_plots.py -i otu_open/otu_table_mc2_w_tax_no_pynast_failures.biom -o Taxasummary -m Mapping.txt -p params.txt

The parameter file is as mentioned below

pick_otus:enable_rev_strand_match True
pick_otus:similarity 0.97
summarize_taxa:level 2,3,4,5,6,7
summarize_taxa:absolute_abundance True
assign_taxonomy:reference_seqs_fp /gg_13_8_otus/rep_set/97_otus.fasta
assign_taxonomy:id_to_taxonomy_fp /gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
align_seqs:template_fp /gg_13_8_otus/rep_set_aligned/85_otus.fasta
filter_alignment:suppress_lane_mask_filter True
filter_alignment:allowed_gap_frac 0.80
filter_alignment:entropy_threshold 0.10

Difference in the results:
Closed ref otu pick top OTU

k__Bacteria;p__Proteobacteria;

c__Gammaproteobacteria;o__Thiotrichales;f__Piscirickettsiaceae;g__Thiomicrospira;s__frisia 9792.0 20986.0 2.0 19260.0

open ref otu pick top OTU and the s__frisia was having

otu_table_mc2_w_tax_no_pynast_failures_L7

k__Bacteria

p__Proteobacteria

c__Gammaproteobacteria

o__Thiotrichales

f__Piscirickettsiaceae

g__Thiomicrospira

s__

26468

5

47425

15444

for otu_table_mc2_w_tax_L7

k__Bacteria

p__Proteobacteria

c__Gammaproteobacteria

o__Thiotrichales

f__Piscirickettsiaceae

g__Thiomicrospira

s__

26468

5

47425

15444

OTU i.e having g__Thiomicrospira;s__frisia for both the files otu_table_mc2_w_tax_no_pynast_failures_L7 and otu_table_mc2_w_tax_L7 is

k__Bacteria;p__Proteobacteria;

c__Gammaproteobacteria;o__Thiotrichales;f__Piscirickettsiaceae;g__Thiomicrospira;s__frisia 420.0 0.0 573.0 172.0

where as the closed ref otu pick gives

k__Bacteria;p__Proteobacteria;

c__Gammaproteobacteria;o__Thiotrichales;f__Piscirickettsiaceae;g__Thiomicrospira;s__frisia 9792.0 20986.0 2.0 19260.0.

1. Though the information in all the cases till the genus is the same but we find the difference in the species for the top hits in closed and open ref method.

2.final_otu_map file in the open reference takes the concatenated information from step1,3,and 4 and generates a biom file. So i expected the abundant species in closed ref should be reflected in the final biom. The step1 has the same number of abundance count values as mentioned in closed ref.
I wonder how to troubleshoot this issue.

Kindly help me out of it.

Thanks
Abhi

Abhi genobio

unread,

Apr 15, 2016, 3:15:10 AM4/15/16

to Qiime 1 Forum

Hi Jonsan,

I have one more confusion related to chimera check.

1. As you have stated above doing a samplewise chimera check is more appropriate rather than working on combined_seqs.fna file.
So, if i perform a de novo chimera check which file should i use as input? Will it be Query fasta ( samplewise query fasta) or I need to do the dereplication and clustering of each sample and the output of clustering step will be used as an input for de novo chimera or can we proceed without performing dereplication and clustering to perform denovo chimera check samplewise? Its adding up to my confusion because i did devo chimera check samplewise directly without doing dereplication and clustering and all my samples gives non-chimeric. So I m more confused.

vsearch --uchime_denovo query.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta

Kindly help me with previous query as well. I am testing all different methods.
Your input will be highly appreciated.

Thanks
Abhi

Colin Brislawn

unread,

Apr 15, 2016, 1:41:13 PM4/15/16

to qiime...@googlegroups.com

Hello Abhi,

Yeah, this whole process is kind of confusing. Let's start from the beginning and describe the two 'normal' ways of chimera checking using your single combined_seqs.fna file.

Chimera check after OTU picking:

vsearch --derep_fulllength --sizeout (optional singleton removal)

vsearch --cluster_size --sizein --sizeout

vsearch --uchime_denovo --sizein --xsize

Chimera check before OTU picking:

vsearch --derep_fulllength --sizeout

vsearch --uchime_denovo --sizein --sizeout

vsearch --cluster_size --sizein --xsize

Here are some patterns I'm sure you see between these two methods.

The last step always uses --xsize because the size annotation mess up other scripts.

The middle step always uses --sizein and --sizeout so that size annotations are passed on.

The first step is always --derep_fulllength, because the --uchime_denovo always needs size annotations in order to find chimeras.

i did devo chimera check samplewise directly without doing dereplication and clustering and all my samples gives non-chimeric.

Of course. --uchime_denovo does not work without dereplication.

Is starting to become more clear? Once we establish how this process works on the combined samples, we can talk about options for going sample by sample.

Colin Brislawn

jonsan

unread,

Apr 15, 2016, 2:19:25 PM4/15/16

to Qiime 1 Forum

Hi Ahbi,

I think what's happening here (re: tax assignment) is that when you pick open reference OTUs, taxonomy gets assigned anew to all the representative sequences in the input data, while in closed-ref OTU picking, you're inheriting the taxonomic assignment from the reference database (as long as you pass it with the -t flag). In the case of GreenGenes, you've got more species-level resolution when doing assignment with the full-length sequences in the reference database. The de novo assignment might get slightly different results due to the shorter input sequences.

However, you should find that the OTUs hitting the closed-reference portion should be identical in both cases.

Hope that helps,

-jon

Abhi genobio

unread,

Apr 20, 2016, 1:08:04 PM4/20/16

to Qiime 1 Forum

I am sorry I was off for few days so could not respond.

Jonsan, please correct me if I am wrong. My understanding from the above mentioned points are, If we do a open reference OTU picking then the results will differ (that is anyways expected) because it does a de novo otu picking and the taxonomy is assigned to all the representative sequences in the input file (Can you please tell in which step of the open reference otu picking does it happen?) ,

though the step1_otus of open reference OTU picking and closed reference OTU picking being identical but not the end results. My confusion is, the species that was picked up to most abundant in closed ref otu pick method

and the same species which was also identified to be most abundant in the open ref otu pick method (at the step1_otus) why are we not able to get the same abundance count value for that species (i.e it should be the same or should be more but cant be less) . Step2 of the open ref otu pick method

works with the failure fasta file and it creates a representative set based on failure fasta. My understanding is, the sequences which are picked already in step1 are not considered in the representative sequences. At the end of open ref otu pick method, the otu map is created with step1 ,3,4 . so atleast the results of step1 should be retained (in terms of abundance count values of the species) which is not happening in my data.

Can you clarify further .

Colin, as per your suggestion I will rework as per the steps mentioned above and then will discuss.

Many Thanks,

Abhi

jonsan

unread,

Apr 21, 2016, 12:21:09 PM4/21/16

to Qiime 1 Forum

Hi Abhi,

You're right insofar as the OTUs themselves are concerned. If you go back to look at the individual OTU identifiers, I suspect that you will find that given reference
OTUs will have the same abundance after Step 1 and in the final table, as they are constructed as you say. The abundance counts of your OTUs are retained in that final OTU table.

However, in the pick_open_reference workflow, the taxonomy assignment of all the OTUs is re-done after the total OTU table is created. This means that the taxonomy assignment of a given OTU may change from what it was listed as in the closed_reference database (from which taxonomy is inherited in Step 1). If you find the specific OTUs assigned to your taxon value that's changing (you can use the filter_taxa_from_otu_table command), I bet you will find that their taxonomy assignment has changed from Step 1 to the final OTU table.

This is all because taxonomy is reassigned to all OTUs at the completion of the open reference workflow. You can see this happening in this example:

(qiime)☕  barnacle:open_ref_94 $ biom summarize-table -i otu_table_mc2_w_tax.biom -o otu_table_mc2_w_tax.summary.txt
(qiime)☕  barnacle:open_ref_94 $ head otu_table_mc2_w_tax.summary.txt 
Num samples: 1356
Num observations: 81412
Total count: 24322961
Table density (fraction of non-zero values): 0.007
....

There are a total of 81412 OTUs in the final OTU table.

Looking at the final rep_set.fna file, there are also 81412 sequences present, one for each OTU. This rep_set sequence file contains sequences from ALL of the OTU picking steps:

☕  barnacle:open_ref_94 $ grep -c '>' rep_set.fna
81412

And this is the file that goes into the final taxonomy assignment, and the assigned taxonomy is then written to the final OTU table:

(qiime)☕  barnacle:open_ref_94 $ tail -n 43 log_20150810110351.txt 

Stdout:

Stderr:

Executing commands.

# Assign taxonomy command 
parallel_assign_taxonomy_uclust.py -i ./open_ref_94/rep_set.fna -o ./open_ref_94/uclust_assigned_taxonomy -T --jobs_to_start 32 
 

Stdout: 

Stderr: 

Executing commands. 

# Add taxa to OTU table command  
biom add-metadata -i ./open_ref_94/otu_table_mc2.biom --observation-metadata-fp ./open_ref_94/uclust_assigned_taxonomy/rep_set_tax_assignments.txt -o ./open_ref_94/otu_table_mc2_w_tax.biom --sc-separated taxonomy --observation-header OTUID,taxonomy 

...

Hope that makes sense!

-j

Abhi genobio

unread,

Apr 21, 2016, 2:11:53 PM4/21/16

to Qiime 1 Forum

Hi Jonsan,

I think I am now clear with your answers. Thanks so much.

I have another query related to assign_taxonomy.py. I have ITS fungal data. I did chimera check before otu picking using vsearch followed by generation of biom file and then assign taxonomy using UNITE database.

I used the below mentioned script but it threw an error

assign_taxonomy.py -i otus_checked.fna -m rdp -t /sh_qiime_release_02.03.2015/developer/sh_taxonomy_qiime_ver7_dynamic_02.03.2015_dev.txt -r /sh_qiime_release_02.03.2015/developer/sh_refs_qiime_ver7_97_02.03.2015_dev.fasta

Traceback (most recent call last):

File "/usr/local/bin/assign_taxonomy.py", line 417, in <module>

main()

File "/usr/local/bin/assign_taxonomy.py", line 394, in main

log_path=log_path)

File "/usr/local/lib/python2.7/dist-packages/qiime/assign_taxonomy.py", line 859, in __call__

max_memory=max_memory, tmp_dir=tmp_dir)

File "/usr/local/lib/python2.7/dist-packages/bfillings/rdp_classifier.py", line 515, in train_rdp_classifier_and_assign_taxonomy

tmp_dir=tmp_dir)

File "/usr/local/lib/python2.7/dist-packages/bfillings/rdp_classifier.py", line 485, in train_rdp_classifier

return app(training_seqs_file)

File "/usr/local/lib/python2.7/dist-packages/bfillings/rdp_classifier.py", line 327, in __call__

remove_tmp=remove_tmp)

File "/usr/local/lib/python2.7/dist-packages/burrito/util.py", line 284, in __call__

'StdErr:\n%s\n' % open(errfile).read())

burrito.util.ApplicationError: Unacceptable application exit status: 1

Command:

cd "/Analysis/"; java -Xmx4000M -cp "/qiime_software/rdpclassifier-2.2-release/rdp_classifier-2.2.jar" edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker "/home/qiime/temp/RdpTaxonomy_Deg1SN.txt" "/home/qiime/temp/tmpSPdiW0V32JAS2EZz3wjr.txt" 1 version1 cogent "/home/qiime/temp/RdpTrainer_NYKa1l" > "/home/qiime/temp/tmpURKGkDD9jpwws1eVjgO5.txt" 2> "/home/qiime/temp/tmpD5QI88uc2yzEv6oiK0Ji.txt"

StdOut:

Exception in thread "main" java.lang.IllegalArgumentException:

There is no node in GENUS level!

at edu.msu.cme.rdp.classifier.train.TreeFactory.createGenusWordConditionalProb(TreeFactory.java:263)

at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.<init>(ClassifierTraineeMaker.java:50)

at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(ClassifierTraineeMaker.java:133)

I referred to the steps mentioned in this link https://groups.google.com/forum/#!msg/qiime-forum/xpD2KzmGXZo/KuQF-VaoKW4J but still not successful.

I have the qiime virtual machine installed in my machine which has 16 GB RAM. I dont think its a configuration problem. I have ran open ref otu picking script of qiime by assigning the taxonomy information in the parameters

file and have never faced such problem but while executing this script separately, the error was raised.

Can you please guide me, how to rectify this bug.

Thanks so much for being so patient and clarifying all my queries.

Many thanks,

Abhi

TonyWalters

unread,

Apr 21, 2016, 2:21:54 PM4/21/16

to Qiime 1 Forum

Ahbi, that isn't the error that usually comes up for RAM issues, but you still might increase the memory allocation with the --rdp_max_memory parameter.

I've noticed in some cases with the UNITE database (and other custom databases) there are non-ASCII characters, and occasionally * characters in the taxonomy file which interfere with the RDP classifier. Can you try running the custom script, linked here (https://gist.github.com/walterst/0a4d36dbb20c54eeb952) to create a cleaned version of the taxonomy mapping file, and use that file as the -t input for assign_taxonomy.py?

Abhi genobio

unread,

May 10, 2016, 7:09:51 AM5/10/16

to Qiime 1 Forum

Hi Colins,

As per your suggestion, I executed

Chimera check before OTU picking:

vsearch --derep_fulllength --sizeout (Dereplication step)

vsearch --uchime_denovo --sizein --sizeout ( This step we get the non-chimera sequences: I think we need to pass --nonchimeras (Chimera check)

vsearch --cluster_size --sizein --xsize (clustering step, we have to pass centroid and -id option as well). For Ex let the output is a otu.fna file

1. Now should I use otu.fna file as an input for closed reference otu pick using qiime? or
2. The above steps is used to generate a de novo reference so that the combined_seqs.fna can be used for alignment against it as mentioned below?
vsearch -usearch_global combined_seq.fna -db otu_checked.fna -strand plus -id 0.97 -uc otu_table_mapping.uc

Please clarify?

You also mentioned about : Chimera check after OTU picking,
In such case which file should I use as an input? I executed closed ref otu picking and the resultant was a biom file.
How can I use biom file for analysis using vsearch to execute the scripts as mentioned below? or Is there are some other steps for otu picking which I missed?
Or I have to do otu pick using vsearch? If so, what script should I use? What will be my input and what is to be used a reference for vsearch?

Chimera check after OTU picking,

vsearch --derep_fulllength --sizeout (optional singleton removal)

vsearch --cluster_size --sizein --sizeout

vsearch --uchime_denovo --sizein --xsize

Please clarify my doubts.

Colin Brislawn

unread,

May 10, 2016, 12:29:14 PM5/10/16

to Qiime 1 Forum

Hello Abhi,

I'll respond in line.

On Tuesday, May 10, 2016 at 4:09:51 AM UTC-7, Abhi genobio wrote:

Hi Colins,

As per your suggestion, I executed

Chimera check before OTU picking:
vsearch --derep_fulllength --sizeout (Dereplication step)
vsearch --uchime_denovo --sizein --sizeout ( This step we get the non-chimera sequences: I think we need to pass --nonchimeras (Chimera check)
vsearch --cluster_size --sizein --xsize (clustering step, we have to pass centroid and -id option as well). For Ex let the output is a otu.fna file

Looks good to me!

1. Now should I use otu.fna file as an input for closed reference otu pick using qiime? or

2. The above steps is used to generate a de novo reference so that the combined_seqs.fna can be used for alignment against it as mentioned below?
vsearch -usearch_global combined_seq.fna -db otu_checked.fna -strand plus -id 0.97 -uc otu_table_mapping.uc

Number 2! The file otu.fna already holds your OTU centroids (you are done with OTU picking!).

That command looks good. Once you run it (add some more threads!) you can convert otu_table_mapping.uc into a .biom file.

Please clarify?

You also mentioned about : Chimera check after OTU picking,

In the above example, you chimera check before, so check after is probably not needed. Just to be clear.

In such case which file should I use as an input? I executed closed ref otu picking and the resultant was a biom file.

You could use the otus.fna file produced by clustering (the previous step).

How can I use biom file for analysis using vsearch to execute the scripts as mentioned below? or Is there are some other steps for otu picking which I missed?
Or I have to do otu pick using vsearch? If so, what script should I use? What will be my input and what is to be used a reference for vsearch?

Vsearch does not yet produce a .biom file directly. Instead it gives you the centroids of your OTUs, and the tools you need to find the number of times each OTU appears in each sample. This is with the --usearch_global command mentioned above. After you use --usearch_global, you will make a .biom table and go back to using the default qiime scripts.

Abhi genobio

unread,

May 11, 2016, 9:43:16 AM5/11/16

to Qiime 1 Forum

Hi Colin,

Thanks for clarifying my doubts.
I have few more queries.
My query fasta has 0.1M reads.
With respect to chimera check before otu picking, I executed the below script,
1.vsearch --derep_fulllength --sizeout (Dereplication step)
2.vsearch --uchime_denovo --sizein --sizeout
Can the output of step2 be considered ahead for analysis using qiime?
My worry is the output of step1 is a clustered file which has the abundance information. For ex I have 14000 clustered sequences from 0.1M reads.At step 2 we perform chimera check. The resultant file of step2 , if taken for otu picking using qiime ( uclust_ref) will again cluster and give abundance information with respect to 14000 reads. So the approach or my understanding of it is right or not.

In short, I want to use vsearch for dereplication and chimera check before otu pick and want to use the resultant file for otu picking using qiime. Please tell me whether the process can be possible or not. If yes, what is the workflow. Is it correct what I mentioned
or have to be tested differently.

Thanks
Abhi

Colin Brislawn

unread,

May 11, 2016, 11:44:53 AM5/11/16

to Qiime 1 Forum

Let's start here:

In short, I want to use vsearch for dereplication and chimera check before otu pick and want to use the resultant file for otu picking using qiime. Please tell me whether the process can be possible or not. If yes, what is the workflow. Is it correct what I mentioned or have to be tested differently.

So this is a bit tricky. When qiime does OTU picking, it also builds an abundance table using the original input reads. I've been doing this whole process outside of qiime using vsearch. Trying to do these steps separately could be hard.

I want to use vsearch for dereplication and chimera check before otu pick and want to use the resultant file for otu picking using qiime.

That should be possible, but I've never done it before. I'm not sure I should be giving advice on things I know so little about...

Maybe a qiime dev can comment more.

Colin

Abhi genobio

unread,

May 11, 2016, 2:17:53 PM5/11/16

to Qiime 1 Forum

Hi Colin,
Thanks for being so kind enough in answering my queries. I have a small query related to closed ref otu pick method.can i use a lesser similarity value ex (0.85 or 0.90) while doing a closed reference otu picking with 97_otus.fna as my ref seq and it's corresponding taxonomy using uclust method. Or the similatity should always be 0.97 with respect to the 97_otus.fna file
Thanks

Message has been deleted

Colin Brislawn

unread,

May 11, 2016, 3:03:18 PM5/11/16

to Qiime 1 Forum

Hello Abhi,

For closed-ref OTU picking, the convention is to use the same similarity value for your pre-clustered database and your read clustering (use .97 for both).

(While you could use any other combination and report these settings in the methods, I think that using standard methods makes it easier for reviewers and readers to understand and compare your results. If you choose to use non-default methods, you have to show why your new methods are better.)

corresponding taxonomy using uclust method

In closed-ref OTU picking, the OTUs already have taxonomies assigned to them, and those taxonomies are just copied over directly. There is no phase of assigning taxonomy with uclust like there is for open-ref and de novo methods.

I hope this discussion has been helpful. OTU picking is one of the most important parts of the qiime pipeline and I appreciate your interest in learning more about it.

Colin

Abhi genobio

unread,

May 12, 2016, 9:37:43 AM5/12/16

to Qiime 1 Forum

Hi Colin,

I have one more doubt using vsearch. I have combined.fasta file . I followed these setps

1. vsearch --derep_full combined.fna --output combined_derep.fasta --log=log --sizeout --minuniquesize 2 (Dereplication)

2. vsearch --uchime_denovo combined_derep.fasta --nonchimeras non_Chimera.fna --sizein --xsize --chimeras chimeras.fasta (Chimera detection)

3. vsearch -usearch_global combined.fna -dbs non_Chimera.fna -strand plus -id 0.97 -uc otu_table_mapping.uc ( otu picking)

My confusion is: For chimera detection, is it always that the reference database used should be chimera free or our query should be chimera free.

Ex: In closed reference otu pick, if we use greengene database we dont perform chimera check because the reference is already chimera checked.
Much the same in vsearch step2 chimera free file gets generated using denovo method and that is taken as a reference sequence in step (3) .
In step 3 again we use combined.fna as input file which has chimera sequences.

So, is it always that the reference database used should be chimera free ?

Cant it be possible that the database which i want to use in step 3 can be from greengene and silva and my input seq be chimera free

ex: vsearch -usearch_global non_Chimera.fna -dbs 97_otus.fasta -strand plus -id 0.97 -uc otu_table_mapping.uc

non_Chimera.fna : Are the representative sequences of my initial combined.fna file which are now chimera free
and my reference is a 97% clustered seq of 16s. How do I get the abundance count information since my query sequences are clustered?

I am get confused with so many questions coming in my mind.

Hope you understand my confusion. It will indeed be great if you can clarify.

Thanks
Abhi

Colin Brislawn

unread,

May 12, 2016, 1:10:34 PM5/12/16

to Qiime 1 Forum

Hello Abhi,

You missed a step!

1. vsearch --derep_full combined.fna --output combined_derep.fasta --log=log --sizeout --minuniquesize 2 (Dereplication)

2. vsearch --uchime_denovo combined_derep.fasta --nonchimeras non_Chimera.fna --sizein --xsize --chimeras chimeras.fasta (Chimera detection)

CLUSTERING: vsearch --cluster_smallmem non_Chimera.fna --centroids otus.fna --id .97 (otu picking)

3. vsearch -usearch_global combined.fna -dbs otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc (biom table creation)

So, is it always that the reference database used should be chimera free ?

Yes. If the database is chimera free, you don't have to worry about chimeras in the input data set.

Cant it be possible that the database which i want to use in step 3 can be from greengene and silva and my input seq be chimera free

You can use greengenes in step three. This would make it the same as 'closed-ref' OTU picking. The process as I described it, is de novo OTU picking because we use OTUs.fna created from our own reads, instead of the greengenes OTUs.

How do I get the abundance count information since my query sequences are clustered?

You convert the otu_table_mapping.uc to a .biom file. I can tell you how to do this whenever you are ready. :-)

Colin

Abhi genobio

unread,

May 15, 2016, 1:25:53 PM5/15/16

to Qiime 1 Forum

Hi Colin,

I successfully executed the below mentioned scripts.

1. vsearch --derep_full combined.fna --output combined_derep.fasta --log=log --sizeout --minuniquesize (Dereplication)

2. vsearch --uchime_denovo combined_derep.fasta --nonchimeras non_Chimera.fna --sizein --xsize --chimeras chimeras.fasta (Chimera detection)

In the clustering step, i executed the below script
3. vsearch --cluster_smallmem non_chimechecked.fna --centroids otus.fna --id .97,
I got the result
Reading file non_chimechecked.fna 100%
43121050 nt in 91635 seqs, min 242, max 534, avg 471
Masking 100%
Counting unique k-mers 100%
Clustering 0%
and non of my sequences were clustered and otus.fna was a empty file. It showed I need to use --usesort

I modified the script and used usersort . It was successful
3(modified) vsearch --cluster_smallmem non_chimechecked.fna --centroids otus_cent.fna --id .97 --usersort

Reading file non_chimechecked.fna 100%
43121050 nt in 91635 seqs, min 242, max 534, avg 471
Masking 100%
Counting unique k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 1669 Size min 1, max 12709, avg 54.9
Singletons: 433, 0.5% of seqs, 25.9% of clusters

Then I executed
4. vsearch -usearch_global combined.fna -db otus_cent.fna -strand plus -id 0.97 -uc otu_table_mapping.uc

Reading file otus_cent.fna 100%
756359 nt in 1669 seqs, min 363, max 531, avg 453
Masking 100%
Counting unique k-mers 100%
Creating index of unique k-mers 100%
Searching 100%
Matching query sequences: 2131250 of 2644969 (80.58%)

I have done these steps. Please let me know why is it needed to add --usersort? How do I proceed to get the abundance information?

I did reference based study like closed ref otu pick using vsearch since its very fast.

I executed the same steps till step 3, and in step 4 used the files highlighted in bold as the input.

vsearch -usearch_global otus_cent.fna -db 91_otus.fasta -strand plus -id 0.91 -uc otu_table_mapping_91.uc

I would like to know whether this step is correct or I have to modify in step 3,
Earlier you mentioned that step 3 is the otu picking step and 4 is for (.uc file generation that can be converted to a biom file), I am slightly confused
because i pass the -db parameter in step 4.
Does this step (4) perform a global alignment of the query against the reference database provided?

More to my confusion, when I converted otu_table_mapping_91.uc file to a tab delimited file,the abundance information was very very low.
I understand that this was the representative sequences (clustered) obtained from my initial sequences i.e combined_seqs.fna. So how do i capture the original
abundance values from this output?

Hope you understand my queries. It will be great if you kindly guide me on how to proceed further.

Many Thanks
Abhi

Colin Brislawn

unread,

May 15, 2016, 8:39:50 PM5/15/16

to Qiime 1 Forum

Great work Abhi!

About this command:

3(modified) vsearch --cluster_smallmem non_chimechecked.fna --centroids otus_cent.fna --id .97 --usersort

I completely forgot about --usersort (which overrides the check for sorting by size). Good catch.

About closed-ref OTU picking:

I did reference based study like closed ref otu pick using vsearch since its very fast.
I executed the same steps till step 3, and in step 4 used the files highlighted in bold as the input.
vsearch -usearch_global otus_cent.fna -db 91_otus.fasta -strand plus -id 0.91 -uc otu_table_mapping_91.uc

That is not how I go about closed-ref OTU picking... In closed-ref OTU picking as implemented in qiime, you map your reads directly to the database. Something like this:

vsearch --usearch_global combined.fna --db gg_97.fasta -id 0.97

This is an important distinction: closed-ref OTU picking does not actually involve picking OTUs! In closed-ref OTU picking, the OTUs are already listed in the database, and you just map your reads to them so you see how many times they appear in your samples.

Does this step (4) perform a global alignment of the query against the reference database provided?

Yes. That's why it's called --usearch_global :-)

More to my confusion, when I converted otu_table_mapping_91.uc file to a tab delimited file,the abundance information was very very low.
I understand that this was the representative sequences (clustered) obtained from my initial sequences i.e combined_seqs.fna. So how do i capture the original
abundance values from this output?

Like I mentioned, you should perform closed-ref OTU picking using your combined.fna file and greengenes as the database. (You never use otus_cent.fna.)

I just reread this thread, and realized that we have basically be discussing the OTU clustering strategy of Robert Edgar. I feel that his description of these steps and why they work well, would be far more helpful than me. Take a look at this page, and read over the steps. His pipeline is slightly different than what we have been discussing, but it is one of the reference points in this field. http://www.drive5.com/usearch/manual/uparse_pipeline.html

Colin

Reply all

Reply to author

Forward