Understanding counts from vsearch logs - numbers don't add up for chimera abundance

116 views
Skip to first unread message

Sanjeev Sariya

unread,
Sep 22, 2016, 6:10:33 PM9/22/16
to VSEARCH Forum
Hi Team,

Thank you for the excellent tool and support. I'm using


vsearch v2.0.3_osx_x86_64, 64.0GB RAM, 24 cores


I'm unable to get ahold of the number generated at the end of chimera log.


Input sequences for derep step: 2286032


vsearch --derep_full full_tagclean.fasta  --output full_derep.fasta --log=vsearch_log --sizeout --minuniquesize 2 2>derep_log.txt


After de-replication count is 113957


I cluster these at 97% identity:

vsearch -cluster_fast full_derep.fasta -id 0.97 --sizein --sizeout --relabel OTU_  --centroids otus.fna 2> cluster_log.txt


Sequence count of otus.fna 6897


I run chimera check on these:


vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --chimeras chimeras.fasta 2> chimera_log.txt


Count of seqs in otus_checked.fna 1817


And finally:


vsearch -usearch_global full_tagclean.fasta -db otus_checked.fna -strand plus -id 0.97 -uc otu_table_mapping.uc 2> usearch_global_log.txt


Chimera log says:


Reading file otus.fna 100%

2864388 nt in 6897 seqs, min 254, max 464, avg 415

Masking 100%

Sorting by abundance 100%

Counting unique k-mers 100%

Detecting chimeras 100%

Found 5074 (73.6%) chimeras, 1817 (26.3%) non-chimeras,

and 6 (0.1%) borderline sequences in 6897 unique sequences.

Taking abundance information into account, this corresponds to

105518 (6.2%) chimeras, 1589751 (93.7%) non-chimeras,

and 774 (0.0%) borderline sequences in 1696043 total sequences.


I'm unable to understand how 1696043 is total. My input for derep for more than this. My count of seqs after derep is different. 


Of which step does Vsearch consider abundance?


Any help shall be highly appreciated.

Thanks,

Sanjeev




Torbjørn Rognes

unread,
Sep 22, 2016, 6:23:03 PM9/22/16
to VSEARCH Forum
Hi

As far as I can see from your message, it seems like some sequences might have been removed by the "--minuniquesize 2" option in the initial dereplication. Probably 2286032-1696043=589989 reads (singletons). Could that be the reason for the "lost" reads?

- Torbjørn

Sanjeev Sariya

unread,
Sep 22, 2016, 6:30:14 PM9/22/16
to VSEARCH Forum
Hi Torbjørn,

Thank you for your prompt reply.
This is embarrassing on my end. You're right.

Reading file full_tagclean.fasta 100%

946256408 nt in 2286032 seqs, min 201, max 490, avg 414

Dereplicating 100%

Sorting 100%

703946 unique sequences, avg cluster 3.2, median 1, max 78547

Writing output file 100%

113957 uniques written, 589989 clusters discarded (83.8%)


I've an off the topic query on "border line" sequences. Why are these not discarded if tool is uncertain about them being chimeric? 


Thank you for your time. :)
---
Sanjeev

Frédéric Mahé

unread,
Sep 23, 2016, 3:57:42 AM9/23/16
to vsearc...@googlegroups.com
Hi,


I've an off the topic query on "border line" sequences. Why are these not discarded if tool is uncertain about them being chimeric?


Border line sequences are kept because they are too close to one of the parent to be sure it is a chimera. If you want to discard them, the first step is to list them with the output option --borderline.

Torbjørn Rognes

unread,
Sep 23, 2016, 4:12:48 AM9/23/16
to VSEARCH Forum
The borderline sequences are excluded. They are not included in the file specified with "--nonchimeras". You can output them with the "--borderline" option if you are interested. In the file specified with "--uchimeout" they are indicated with a question mark (?) in the last column, while chimeras are indicated with a "Y" and non-chimeras with a "N".

- Torbjørn

Sanjeev Sariya

unread,
Oct 25, 2016, 10:44:19 AM10/25/16
to VSEARCH Forum
Dear Torbjørn,

Thank you for clarifying that "border-line" sequences are excluded from "--nonchimeras" file output. That gives me a relief otherwise I had to do run everything again from chimera check.

I've followup query:

1)
I used the flag "--borderline" to output them to another file. 
If I wish to include them in nonchimeras, how do I incorporate them in final "otu_table_mapping.uc" file if that is possible? 

Thank you again for your tremendous help.
--Sanjeev

Torbjørn Rognes

unread,
Oct 26, 2016, 7:29:55 AM10/26/16
to VSEARCH Forum
Hi!

I would against including the borderline sequences. Here is how you may do it anyway.

If you use a command like this to perform chimera detection:

vsearch --uchime_denovo otus.fna --sizein --nonchimeras nonchimeras.fasta --borderline borderline.fasta

Then you should be able to simply add them together like this:

cat nonchimeras.fasta borderline.fasta > accepted.fasta

You can them map all the reads to the accepted otus like this:

vsearch -usearch_global full_tagclean.fasta -db accepted.fasta -strand plus -id 0.97 -uc otu_table_mapping.uc

- Torbjørn

Sanjeev Sariya

unread,
Oct 26, 2016, 9:48:24 AM10/26/16
to VSEARCH Forum
Hi Torbjørn,

Yes, adding border line would be definitely sketchy. I wanted to know how to play around nonetheless and see what do they get classified as.
Thanks for your extensive help. 

Cheers,
Sanjeev
Reply all
Reply to author
Forward
0 new messages