Alignment of 16S rRNA v4-v5 sequences

Natasha Arora

unread,

Jun 22, 2017, 6:22:05 AM6/22/17

to Qiime 1 Forum

Dear Qiime Team,

We have examining bacterial sequences from the v4-v5 region of the 16S rRNA gene for number of human body sites. Our data set is small (ca. 350 OTUs after OTU picking and taxonomic assignment using RDP classifier). We would like to obtain a phylogenetic tree to calculate PD among other diversity estimates. However, I am uncertain as to how best to perform the multiple sequence alignment. Why are alignments typically done against a reference database such as greengenes or SILVA rather than de novo with only our own datasets? The alignment we did against the full length sequence of greengenes (using PyNAST) had some positions that looked rather misaligned (after visual checking).

And if it is better to do MSA against a reference database, would it make an important difference to extract the v4-v5 region from the sequences of the reference database (as opposed to using full length sequences)?

Thank you very much for your help,

Natasha

justink

unread,

Jun 22, 2017, 5:33:39 PM6/22/17

to Qiime 1 Forum

Hmm, I'll take a shot at those questions, but they're not trivial.

A major reason people use reference databases is speed: an MSA is typically O(n^2) (n is the number of sequences you're aligning), but with a reference it's O(n). That doesn't really apply here.

Another reason is the information you get from existing alignments. Things like knowledge of conserved vs. hypervariable regions are useful when inferring phylogeny (see e.g. here). I'm suspicious of de novo alignments of 16S, because they sometimes align the hypervariable regions at the expense of the conserved regions. That's not likely to reflect evolution.

I doubt extracting the v4-v5 region would make much difference, unless you have reads somehow aligning to other parts of the 16S—that'd be real bad.

All that said, if your alignments to greengenes look dubious, I wonder if the de novo alignment would work better. How's that for noncommittal?

TonyWalters

unread,

Jun 23, 2017, 1:30:39 AM6/23/17

to Qiime 1 Forum

Hello Natasha,

One issue to note with using the Greengenes reference alignment is that it has some positions filtered (SSUalign), so you might get some strangeness depending upon how you did it (see https://github.com/biocore/qiime-default-reference/issues/14).

You can get scenarios were the query sequences are sufficiently divergent from the reference template alignment that they 1. get largely shoved into a gapped position or 2. get put into another region of the SSU gene altogether. These are usually a very small fraction of the reads though. Your suggestion of creating a template reference alignment of the target region could address some of this issue, if that's what you're seeing a lot of. But since you don't have a huge number of sequences, doing a MSA de novo and visualizing/curating it in ARB (or whichever tool you prefer) may be a quicker route.

-Tony

Natasha Arora

unread,

Jun 29, 2017, 7:15:30 AM6/29/17

to Qiime 1 Forum

Dear Tony and Justin,

Thanks a lot for your replies and sorry for my delay getting back to you but we found out that there may have been another reason why the phylogenetic tree did not make much sense (to us). So not only because of issues in the multiple sequence alignment. We re-did the taxonomic assignment using SILVA 128 99 nR, to compare with our results using the RDP classifier on the 16S RDP training set. Twelve of the taxa classified as firmicutes with the RDP training set were now tenericutes, cyanobacteria and proteobacteria (with SILVA). I have to double check but this might mean that firmicutes is monophyletic after all...

But now, back to the alignment of the 16S, we did it de novo (using clustal omega) on our 352 sequences and except for two regions (where I am not confident regarding the gaps introduced), the alignment looks pretty decent. Nonetheless, we will compare the results with using pynast - and as the reference database we have subsampled the phyla of interest from SILVA (so only the taxa we found through the taxonomic assignment). I can update you once we have the results on this.
These results make me realize that we cannot emphasize enough how important it is to check taxonomic assignments and multiple sequence alignments before any downstream analyses. Certainly not trivial tasks.

Thank you both very much, if you have any further input we are always grateful for feedback :-)
Natasha

Reply all

Reply to author

Forward