--uchime_denovo doesn't write nonchimeras using abundance information

30 views
Skip to first unread message

Emily Van Syoc

unread,
Oct 25, 2023, 11:44:10 AM10/25/23
to VSEARCH Forum
Hello,

Thanks for writing such a great tool. This is my first time using VSEARCH and I believe my issue is just a flag I'm missing somewhere.

When I run the standard pipeline, I am losing >50% of reads as chimeras. Looking closer at the --uchime_denovo output to stdout, I can see that when taking abundance information into account, the number of chimeras is much closer to what I would expect:

"Reading file sorted.fasta 100%  

2920080 nt in 12218 seqs, min 32, max 409, avg 239

Sorting by abundance 100%

Counting k-mers 100% 

Detecting chimeras 100%  

Found 4330 (35.4%) chimeras, 7888 (64.6%) non-chimeras,

and 0 (0.0%) borderline sequences in 12218 unique sequences.

Taking abundance information into account, this corresponds to

33473 (0.8%) chimeras, 4156615 (99.2%) non-chimeras,

and 0 (0.0%) borderline sequences in 4190088 total sequences."


However, the output from --nonchimeras has only the 7888 (64.6%) non-chimeras, and I can't figure out a way to only remove the chimeras detected when taking abundance information into account (should result in 99.2% of input sequences?). 

Thanks! -Emily

Colin J Brislawn

unread,
Oct 25, 2023, 11:34:02 PM10/25/23
to VSEARCH Forum
Good afternoon,

Have you already run the remaining steps in the pipeline?
VSEARCH --usearch_global all.fasta --db otus.fasta

That will map all your reads to your OTUs, but only to the 7888 OTUs (64.6%) that are non-chimeric.

What you take the sum of the counts inside otutab.txt, do you get about 4156615 reads (99.2-ish percent) or less than half like you mentioned?

Colin

Emily Van Syoc

unread,
Oct 27, 2023, 8:50:34 AM10/27/23
to VSEARCH Forum
Hello Colin,

Yes, I am running the —usearch_global to map the nonchimeric OTUs to the samples to get a feature table. The total read counts in the OTU table are 4,580,265, which is interesting since this doesn’t match the nonchimeric output in my first email below (I would expect 4156615 reads). I think I may be misunderstanding how the uchime_denovo command is outputting those stats. It’s detecting 35% of OTUs as chimeras, but only removing 0.8% of the reads? I’m not sure I understand how that is working - are those 35% chimeric OTUs present in such low abundance that it’s only affecting <1% of the total reads in the dataset?

Do you have any insight into why so many of the OTUs are detected as chimeric? It seems unusual to lose 35% of the OTUs.

Colin J Brislawn

unread,
Oct 27, 2023, 10:41:51 AM10/27/23
to VSEARCH Forum
>I’m not sure I understand how that is working - are those 35% chimeric OTUs present in such low abundance that it’s only affecting <1% of the total reads in the dataset?
That's correct!

>While 35.4% of the OTUs are being flagged as chimeras here, that accounts for a small number of total reads.
This makes sense as these data sets are heavily skewed towards the most common features; most reads come from the most common features. Think a natural log distribution

>It seems unusual to lose 35% of the OTUs.
This seems pretty normal to me. How many PCR cycles were run? Have you run positive controls with a known composition?

Frédéric Mahé

unread,
Oct 30, 2023, 7:48:28 AM10/30/23
to VSEARCH Forum
> are those 35% chimeric OTUs present in such low abundance that it’s only affecting <1% of the total reads in the dataset?

Yes, that's correct.

I agree with Colin, tagging 35% of the OTUs as chimeric is not unusual. I don't have stats, but I often see values ranging from 20 to 75%. Taking abundance information into account, this usually corresponds to a few percents.

Here is a though experiment with three unique sequences (or OTU representatives if you want), representing 100 reads in total: "parentA;size=50", "parentB;size=49", "chimeraAB;size=1"

Chimera detection stats will look like that:

"Found 1 (33.3%) chimeras, 2 (66.6%) non-chimeras,

and 0 (0.0%) borderline sequences in 3 unique sequences.

Taking abundance information into account, this corresponds to

1 (1.0%) chimeras, 99 (99.0%) non-chimeras,

and 0 (0.0%) borderline sequences in 100 total sequences."

Frédéric Mahé

unread,
Oct 30, 2023, 11:29:08 AM10/30/23
to VSEARCH Forum
Here is a toy-example replicating the above example:

https://github.com/torognes/vsearch/issues/537

Emily Van Syoc

unread,
Oct 31, 2023, 10:52:17 AM10/31/23
to VSEARCH Forum
Great, thank you! 
Reply all
Reply to author
Forward
0 new messages