Reagrding UCLUST (cluster formation)

13 views

Skip to first unread message

Kruttika Phalnikar

unread,

Jun 15, 2017, 11:24:46 PM6/15/17

to Qiime 1 Forum

Hi all,

Thank you for reading this post!

I have a naive question regarding the functioning of OTU clustering.

After joining the pair ends and passing the reads through split_library_fastq.py one generates a histogram and a log file.
In this histogram one has a ditrubution of lengths and corrosponding number of sequences generally with a huge peak at a median length
Also in multiple publications people mention the read length cut-offs as >200bps/ 200-400 bps/ ~420 bps etc - before they start picking OTUs

When we pick otus using uclust with this distribution of read lengths, every cluster will have reads with different lengths within itself?. I am a bit confused about 97% similarity within a cluster with different read lengths.

Could someone please explain?

Thank you
Kruttika

justink

unread,

Jun 16, 2017, 1:05:42 AM6/16/17

to Qiime 1 Forum

I've asked for some help from a someone with perhaps more expertise than I have, but in the interim...

I've found that joining paired ends sometimes makes mistakes, and if E coli is (say) 300 nt between your primers, and you join 2 paired ends and get something 150nt or 500 nt long, it's probably not a cool new 16S: it's probably a mistake in joining paired ends. So, I'm a fan of some rough cutoffs.

Then, OTU picking will match each sequence against another sequence, and roughly estimate the 97% similarity for the length of the smaller seq. Reference seqs are often much longer than our illumina amplicons. So each sequences in an OTU will be similar to the reference seq (or the 'seed' sequence if it's a de novo OTU), but not necessarily as similar to any other sequence in that OTU. At least I'm about 80% sure of that last sentence.