Recommendations for clustering ONT reads with variable length?

120 views
Skip to first unread message

Ashley Paulsen

unread,
Nov 1, 2024, 3:50:59 AM11/1/24
to VSEARCH Forum
I'm trying to cluster full length 16S ONT reads and I'm hitting a bit of a stumbling block at the clustering step. I've already dereplicated them, but there are a lot of singletons due to length variation that I don't want to lose. I was able to run --cluster_unoise with a minsize of 2, but then I lose all the singletons and end up with a lot of reads not being assigned to OTUs.

 I tried running --cluster_unoise with a minsize of 1 and --cluster_size, but both timed out after 24 hours. I just resubmitted them for 7 days, so while I'm waiting I figured I would see if anyone had any suggestions.

Colin J Brislawn

unread,
Nov 1, 2024, 7:54:03 AM11/1/24
to VSEARCH Forum
I suspect that technical differences between ONT and Illumina are breaking the expectations of the denoising pipeline.

Illumina reads are always the same length (unless the run fails!). --derep_fulllength and --cluster_unoise are built around this assumption.

As you have discovered, when you switch from short-and-accurate reads to long-but-noisy reads, the old methods just don't work.
--derep_fulllength does not make sense for variable length reads and --derep_prefix is a different algorithm that is lossy quality scores and basepairs themselves!

In untargeted genomics, a similar change in tools and algorithms took place when ONT was first introduced.
bwa works great for Illumina but not for Nanopore! And thus:
>Note: minimap2 has replaced BWA-MEM for PacBio and Nanopore read alignment

While you wait for those jobs to run, may I suggest a lit review? There has got to be a better way!

Colin


Ashley Paulsen

unread,
Nov 3, 2024, 5:23:06 AM11/3/24
to VSEARCH Forum
Sadly, most of the literature for doing ONT full length 16S doesn't include any clustering, most people use Emu or sintax on their reads and call it a day.

There are very few clustering pipelines for ONT reads and most of them actually use vsearch for their clustering (a lot of them run it through Qiime, which I am actively avoiding). None of them really listed out any parameter optimization, especially the ones that used it through Qiime. I've found two other clustering programs but their documentation is lacking and I haven't had any luck even getting them to run.

Colin J Brislawn

unread,
Nov 3, 2024, 7:12:37 AM11/3/24
to VSEARCH Forum
Thanks for sharing your lit review! I was hoping someone had popularized a nanopore pipeline since I last looked, but here we are.

I suppose skipping clustering and going directly to taxonomy with a nanopore aligner is an elegant solution. It does not give an abundance table though... unless only hits to the database are counted.

I don't think Qiime2 includes first-party support for Nanopore... Are you thinking of https://github.com/MaestSi/MetONTIIME 

I'm sorry I'm not much help here.

>which I am actively avoiding
Strangely, I've been on both sides of this! (I've contributed to this qiime-free pipeline and the q2-vsearch plugin. 🤷) Why is qiime2 not right for you?

torognes

unread,
Jan 31, 2025, 3:44:59 AMJan 31
to VSEARCH Forum
Hi

I have a few recommendations regarding Nanopore reads.

The first is to lower the gap open penalty during alignment by using the following option to vsearch: --gapopen 4I/2E

This will reduce the gap open penalty from 20 (default) to 4 for internal gaps. Terminal gaps are penalized much lower (2), and should stay low.

This makes the alignment scores and gap penalties similar to what minimap2 uses for aligning nanopore reads, and it seems to work well with some data I've tested it on.

Dereplication does not seem to make sense with nanopore reads due to the amount of errors, so I would recommend starting with clustering. The amount of errors seems to depend on the chemistry used and on the base calling accuracy. I experienced an average identity of around 97-98% between sequences that should have been identical, so a clustering with an id level of around 97 percent could perhaps be useful before taxonomic assignment.

I haven't done any rigorous testing of this, so please take the advice with caution.

- Torbjørn
Reply all
Reply to author
Forward
0 new messages