long duration for denovo chimera detection

Naina kumari

unread,

Jun 18, 2019, 3:25:12 AM6/18/19

to VSEARCH Forum

Dear vsearch developers,

I performed denovo chimera checking by using vsearch on 16s data. The input fasta file is around 9GB and I have used the following command:

$ parallel --eta -j 30 'vsearch --uchime_ref {} --db /home/bmgu/Desktop/skin/gold.fa --chimeras chimeras_ref.fasta --nonchimeras non_chimeras_ref.fasta --uchimeout chimera_ref.uc --relabel_keep' ::: multiplexed_linearized_dereplicated.fasta

The command ran for a week and only 20% of the output was generated and it seems the command will go on running endlessly. Pls guide me about this step, if I am using the right command? How much time does it take for vsearch to complete the denovo chimera detection. Please help as I am stuck in this step due to large sample size.

Thanking you

Naina Kumari

Torbjørn Rognes

unread,

Jun 18, 2019, 9:17:45 AM6/18/19

to VSEARCH Forum

Hi Naina Kumari

If you want to perform de novo chimera detection you should use the vsearch command uchime_denovo, not uchime_ref. In my experience de novo chimera detection is generally more appropriate than reference based detection, but it may depend on the reference database used, and on the data used.

However, the uchime_ref command in vsearch is faster and handles multithreading better than uchime_denovo.

You should specify the number of threads to run in vsearch with the "--threads" option. If you do not specify it, it will by default use all the cores (threads) available in the computer.

I am not very familiar with the (gnu?) parallel command, but I doubt it is used correctly here. To me it appears that you are starting 30 jobs running vsearch on the same dataset. In addition, within each job vsearch will start as many threads as there are cores in your machine. This is likely to be very inefficient.

I would advice you to try just this command instead (and not use parallel):

vsearch --uchime_ref multiplexed_linearized_dereplicated.fasta --db /home/bmgu/Desktop/skin/gold.fa --chimeras chimeras_ref.fasta --nonchimeras non_chimeras_ref.fasta --uchimeout chimera_ref.uc --relabel_keep

It may take several days, but you should be able to see the progress. You could also look at the size of the output files from another terminal.

I hope this helps.

Best wishes,

- Torbjørn

Naina kumari

unread,

Jun 19, 2019, 2:32:00 AM6/19/19

to VSEARCH Forum

Hi Torbjørn Rognes

First of all I thank you for replying and I am extremely sorry because I wrote the wrong command in the earlier post. The command I posted earlier is the one I used for reference based chimera detection and it ran very efficiently, infact it got completed in less than an hour. I ran denovo chimera checking on non-chimeras fasta file I obtained from reference based chimera detection and the command I used was :

parallel --eta -j 30 'vsearch --uchime_denovo {} --chimeras chimeras_denovo.fasta --nonchimeras non_chimeras_denovo.fasta --uchimeout chimera_denovo.uc --relabel_keep' ::: non_chimeras_ref.fasta

Its written in vsearch manual that denovo chimera detection does not support multithreading so the threads option cannot be used here. Since I wanted to execute the command in multiple cores explicitly, that is the reason I opted for using GNU parallel since it worked well with reference based chimera detection. But the denovo command is still taking a lot of time, is it becuase even if I gave 30 cores for running the command, it is still running in a single core? Do you have any idea that how much time it actually takes denovo chimera detection to be completed by using vsearch without using parallel (file size>10GB) because I felt the command would run for a month if it goes in this speed and I have lots of samples to be analysed together(>200).

Thanking you

Naina Kumari

Torbjørn Rognes

unread,

Jun 19, 2019, 3:20:52 AM6/19/19

to VSEARCH Forum

Hi

If you want to use the uchime_denovo command, you cannot use multiple threads with the "parallel" command as you indicate, you'll have to run it like this:

vsearch --uchime_denovo non_chimeras_ref.fasta --chimeras chimeras_denovo.fasta --nonchimeras non_chimeras_denovo.fasta --uchimeout chimera_denovo.uc --relabel_keep

It may take days, though.

Have you dereplicated and clustered your sequences in advance? It is usually sufficient to run chimera checking on the OTU representative sequences, not the entire dataset.

- Torbjørn

Naina kumari

unread,

Jun 19, 2019, 5:21:04 AM6/19/19

to VSEARCH Forum

Dear Dr. Torbjørn,

Thanks for your prompt reply.

We are following the QIIME and UPARSE pipelines in our lab, where OTU binning is performed after QA/QC and Chimera detection. Till date, for smaller number of samples, 20-30 individuals, the usearch61 was used to perform chimera detection which used to take 2-3 hours on single core processor. However, usearch61 was unable to perform Chimera detection for file size more than 4 GB, where we started using vsearch. Right now, we have 100 samples in batches for multiple projects, but the chimera detection part has become a bottleneck in analysis. You suggested performing OTU clustering before chimera detection, both reference based as well as denovo. Will it not run the risk of underestimation of Chimeric reads and effect the diversity estimates like shannon and chao indices when performed only on Representative sequences? Can you forward some papers where you have successfully used this method so that I can discuss with my lab members and PI for justifying our deviation from QIIME pipeline and following the new method as proposed by you? This will be of immense help from your side.

Thank you in advance. Looking forward for your reply.

-Naina

Colin Brislawn

unread,

Jun 19, 2019, 12:35:00 PM6/19/19

to VSEARCH Forum

Hello Naina,

I can help with this part:

> Can you forward some papers where you have successfully used this method so that I can discuss with my lab members and PI for justifying our deviation from QIIME pipeline and following the new method as proposed by you?

Spatio-temporal microbial community dynamics within soil aggregates

https://www.sciencedirect.com/science/article/pii/S0038071719300252

Selection, Succession, and Stabilization of Soil Microbial Consortia
https://msystems.asm.org/content/4/4/e00055-19/

Our order was filter reads, dereplicate, cluster_size, uchime_denovo, uchime_ref.

So why does this method work? Let's compare the two orders:

Original order: dereplicate -> uchime_denovo -> uchime_ref -> cluster_size
Updated order: dereplicate -> cluster_size -> uchime_denovo -> uchime_ref

Note that the steps are the same and the output is the same, and only their order changes. So in the original order, reads are checked for chimeras, while in the updated order, OTUs are checked for chimeras.

To answer your questions:

> Will it not run the risk of underestimation of Chimeric reads

Yes, this is a valid risk. However, we find this does not happen most of the time.

> effect the diversity estimates like shannon and chao indices when performed only on Representative sequences?

Nope. Both these methods make representative sequences (OTUs), so the output should be very similar or identical.

Let me know if that helps!

Colin

Naina kumari

unread,

Jun 20, 2019, 7:12:25 AM6/20/19

to VSEARCH Forum

Hello Colin,

Thank you for your reply and I am very grateful for your help. These papers shall help me in my analysis and I will try to follow the new pipeline as suggested by u and Dr. T. Rognes which is : dereplicate -> cluster_size -> uchime_denovo -> uchime_ref

I will let you know if this pipeline works successfully for the analysis of my data