Alignment issues

Joanito Liberti

unread,

Aug 15, 2014, 8:13:36 PM8/15/14

to oligo...@googlegroups.com

Dear Murat,

I've followed all your suggestions to prepare the fasta file for oligotyping but I have a doubt. After I run o-trim-uninformative-columns-from-alignment I see that my sequences still do not end in the same position, so when I run the entropy analysis I have many entropy peaks at the end of the alignment. You say that all the sequences should have the same length, so what do you suggest me to do? Should I keep them like this and move on with the analysis or should I do something to trim them all to the same length?

It's a 454 generated dataset and I've now pulled out the genus of interest from the dataset. I see that most of the sequences have between 406 and 430 bp. But then there are just a few that are 300 something bp. Should I maybe trim everything down to 406 and remove sequences shorter than this? If so, how can I do this without going through every sequence, since I have 60,000 sequences in this dataset?

Sorry for many questions!

Thanks a lot for your help!

Best,

Joanito

A. Murat Eren

unread,

Aug 16, 2014, 2:09:13 PM8/16/14

to oligo...@googlegroups.com

Hi Joanito,

There are two things. They sound similar, but they are very different from each other:

Each entry in the FASTA file should have the same number of characters. Yes, but when would the length of entries in a FASTA file would differ?

There two possibilities:

A quality filtering may have trimmed each read from different locations, for instance, whenever the quality was lower than a certain threshold for each read. So you would get reads like this:

xxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxx

However, these reads may be coming from one organism, yet, the quality filtering method may have introduced a length variation across them artificially. So in this case length variation has nothing to do with biology. One may align them to make oligotyping pipeline shut its mouth about the fact that not all reads have equal number of characters. And then they would look like this:

xxxxxxxxxxxxxxxxxxxxx----
xxxxxxxxxxxxxxxxxx-------
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx----------
xxxxxxxxxxxxxxxxxxxxx----

And the entropy of these reads would contain a lot of peaks at the end. That is not what we want. In this scenario, the what needs to be done is to trim them to the shortest read, or better yet, trim them to a reasonable size, and remove the very short reads from the dataset to make sure they are all trimmed to an evolutionarily somewhat meaningful position. So these reads, after removing the fourth one, would look like this, for instance:

xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx

The second possibility is that all reads may have been trimmed to an evolutionarily meaningful position, such as to the distal primer, or to an anchor. In this case, the length variation is biologically relevant, if the sequencer does not create a lot of indel errors, such as 454 does (I will come to this later). Say you have partially overlapping Illumina reads and you merged and trimmed them. They would look like this (say x, y, and z identifies different organisms):

xxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyyyyy
xxxxxxxxxxxxxxxxxxxxxxx
zzzzzzzzzzzzzzzzzzzzzzzzzz
zzzzzzzzzzzzzzzzzzzzzzzzzz
xxxxxxxxxxxxxxxxxxxxxxx

In this case the length variation is biologically relevant. Because all these reads start at the proximal primer, and end at the distal primer. In this case, if this is not 454 data, it is perfectly OK to pad these reads with gap characters to make them equal in length. If you installed oligotyping you can simply type o-pad-with-gaps YOUR_FASTA.fa and it would do it for you, and the result would look like this:

xxxxxxxxxxxxxxxxxxxxxxx---
yyyyyyyyyyyyyyyyyyy-------
yyyyyyyyyyyyyyyyyyy-------
xxxxxxxxxxxxxxxxxxxxxxx---
zzzzzzzzzzzzzzzzzzzzzzzzzz
zzzzzzzzzzzzzzzzzzzzzzzzzz
xxxxxxxxxxxxxxxxxxxxxxx---

And it is ready for oligotyping or MED analysis. If this is 454 data, it gets complicated again. Because with 454, you have homopolymer region-associated indels. So your reads look like this, although they all end at an evolutionarily meaningful position (say 'o' identifies a homopolymer indel, like an extra G after 3 G's):

xxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyoyyyyy
xxxxxxxxxxxxxxxxxxxxxxx
zzzzzzzzzzzzzzzzzzzzzzzzzz
zzzzzzzzzzzzzzzozzzzzzzzzzz
xxxxxxxxxxxxxxxxxxxxxxx

Now you can't just add gap characters at the end. Because the length variation is partially artificial. You need to align these reads to ameliorate that part of the length variation. When you use PyNAST against the GreenGenes template, it does something like this:

xxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyy-yyyyy
yyyyyyyyyyyyyyoyyyyy
xxxxxxxxxxxxxxxxxxxxxxx
zzzzzzzzzzzzzzz-zzzzzzzzzzz
zzzzzzzzzzzzzzzozzzzzzzzzzz
xxxxxxxxxxxxxxxxxxxxxxx

And the final form look like this without any trimming or padding, due to whichever template they hit:

xxx----xxxxxxxxxxxxxxxxxxxx
yy---yyyyyy--yyyyyy-yyy--yy
yy---yyyyyy--yyyyyyoyyy--yy
xxx----xxxxxxxxxxxxxxxxxxxx
zzzzzzzzzzzzzzz-zzzzzzzzzzz
zzzzzzzzzzzzzzzozzzzzzzzzzz
xxx----xxxxxxxxxxxxxxxxxxxx

So, again it is ready to do for oligotyping or MED analysis.

I hope this makes sense.

Here is a recipe for the lazy:

"I have Illumina data, all of my reads are equal in length": You don't need to do anything, you can just oligotype / MED analyze that dataset.
"I have Illumina data, length of my reads vary, because the quality filtering I used trimmed them at random locations": Remove 10% of the shortest reads from your dataset, trim the remaining reads to the shortest one left in the dataset before the oligotyping / MED analysis.
"I have Illumina data, length of my reads vary, because I merged partially overlapping reads and trimmed them from the primer location": Pad your reads with gap characters before the oligotyping / MED analysis.
"I have 454 data, my reads are not aligned, but they are equal in length": Align them using PyNAST (because someone trimmed them in a meaningless way), then trim from the end of the alignment to have a nice rectangle filled with nucleotides (unlike the image above) before the oligotyping / MED analysis.
"I have 454 data, the quality filtering I used trimmed them from random locations": Remove 10% of the shortest reads from your dataset, trim the remaining reads to the shortest one left in the dataset, align them using PyNAST, then RE-TRIM the aligned reads to a have a nice rectangle filled with nucleotides before the oligotyping / MED analysis.
"I have 454 data, I trimmed them from the primer location": Align your reads using PyNAST, and you are ready to go.

I hope these make sense.

Best,

Joanito Liberti

unread,

Aug 17, 2014, 4:35:34 AM8/17/14

to oligo...@googlegroups.com

Dear Murat,

thank you very much for your very detailed reply! I now understand much better what is going on with my dataset.

Basically the sequences I took are the ones right after the trim.seqs step (the one after denoising) in mothur (I took them so high up in the pipeline to maximize the total number of initial sequences for oligotyping). So those sequences have been filtered to remove seqs shorter than 200 bp and with homopolymers longer than 8 bp, the forwards primer was removed but NOT the reverse one. So when I align them with PyNast what happens is that I have 1) variable lengths at the end of the alignment because the reverse primer sequence is still there, 2) some sequences are longer than 200 bp but are not long enough to get to the reverse primer....

Now I can see that I have two (three) options (tell me if I'm wrong):

I could just trim out the reverse primer from this dataset and perhaps filter out reads that are not long enough to get to that position...

Or, I could start again by taking the sequences a bit further down in the pipeline, after the align.seqs and screen.seqs, so that they have already been filtered against the Silva database to remove misaligned sequences and to trim both ends at a precise position of the 16S alignment.... or I could go even a bit more down and take the sequences after chimera checking...

I see that I might have a tradeoff here, precision and quality of the dataset vs. total number of sequences that I input in the analysis.

What would you suggest me to do then?

Thanks a lot again for your kind help!

Best,

Joanito

A. Murat Eren

unread,

Aug 17, 2014, 8:14:31 AM8/17/14

to oligo...@googlegroups.com

Hi Joanito,

I think the best approach would be to try each of the paths you mentioned and evaluate the gain.

But I would have removed primers first, and remove all the reads that didn't make it all the way to the primers and then align them using PyNAST / GreenGenes.

Best,

Elisa Ramos Sevillano

unread,

Jul 14, 2015, 9:44:36 AM7/14/15

to oligo...@googlegroups.com

Dear Joanito,

I am starting the analysis with the same Mothur output as you, and I was wondering If you could tell me which approach worked better for you.

thanks a lot
Elisa

picas...@gmail.com

unread,

Jun 24, 2016, 11:21:35 AM6/24/16

to Oligotyping and MED

Hi,

And if my case is :

I trimmed all my reads then I merged my trimmed reads.

What should I do.

Reply all

Reply to author

Forward