Alignment issues

680 views
Skip to first unread message

Joanito Liberti

unread,
Aug 15, 2014, 8:13:36 PM8/15/14
to oligo...@googlegroups.com
Dear Murat,

I've followed all your suggestions to prepare the fasta file for oligotyping but I have a doubt. After I run o-trim-uninformative-columns-from-alignment I see that my sequences still do not end in the same position, so when I run the entropy analysis I have many entropy peaks at the end of the alignment. You say that all the sequences should have the same length, so what do you suggest me to do? Should I keep them like this and move on with the analysis or should I do something to trim them all to the same length? 
It's a 454 generated dataset and I've now pulled out the genus of interest from the dataset. I see that most of the sequences have between 406 and 430 bp. But then there are just a few that are 300 something bp. Should I maybe trim everything down to 406 and remove sequences shorter than this? If so, how can I do this without going through every sequence, since I have 60,000 sequences in this dataset?

Sorry for many questions!

Thanks a lot for your help!

Best,
Joanito

A. Murat Eren

unread,
Aug 16, 2014, 2:09:13 PM8/16/14
to oligo...@googlegroups.com
Hi Joanito,

There are two things. They sound similar, but they are very different from each other:

Each entry in the FASTA file should have the same number of characters. Yes, but when would the length of entries in a FASTA file would differ?

There two possibilities:

  • A quality filtering may have trimmed each read from different locations, for instance, whenever the quality was lower than a certain threshold for each read. So you would get reads like this:

    xxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxx

    However, these reads may be coming from one organism, yet, the quality filtering method may have introduced a length variation across them artificially. So in this case length variation has nothing to do with biology. One may align them to make oligotyping pipeline shut its mouth about the fact that not all reads have equal number of characters. And then they would look like this:

    xxxxxxxxxxxxxxxxxxxxx----
    xxxxxxxxxxxxxxxxxx-------
    xxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxx----------
    xxxxxxxxxxxxxxxxxxxxx----

    And the entropy of these reads would contain a lot of peaks at the end. That is not what we want. In this scenario, the what needs to be done is to trim them to the shortest read, or better yet, trim them to a reasonable size, and remove the very short reads from the dataset to make sure they are all trimmed to an evolutionarily somewhat meaningful position. So these reads, after removing the fourth one, would look like this, for instance:

    xxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxx

  • Here is a real world example of how 454 reads that were trimmed at random locations look like after the alignment with PyNAST:



  • The second possibility is that all reads may have been trimmed to an evolutionarily meaningful position, such as to the distal primer, or to an anchor. In this case, the length variation is biologically relevant, if the sequencer does not create a lot of indel errors, such as 454 does (I will come to this later). Say you have partially overlapping Illumina reads and you merged and trimmed them. They would look like this (say x, y, and z identifies different organisms):

    xxxxxxxxxxxxxxxxxxxxxxx
    yyyyyyyyyyyyyyyyyyy
    yyyyyyyyyyyyyyyyyyy
    xxxxxxxxxxxxxxxxxxxxxxx
    zzzzzzzzzzzzzzzzzzzzzzzzzz
    zzzzzzzzzzzzzzzzzzzzzzzzzz 
    xxxxxxxxxxxxxxxxxxxxxxx

    In this case the length variation is biologically relevant. Because all these reads start at the proximal primer, and end at the distal primer. In this case, if this is not 454 data, it is perfectly OK to pad these reads with gap characters to make them equal in length. If you installed oligotyping you can simply type o-pad-with-gaps YOUR_FASTA.fa and it would do it for you, and the result would look like this:

    xxxxxxxxxxxxxxxxxxxxxxx---
    yyyyyyyyyyyyyyyyyyy-------
    yyyyyyyyyyyyyyyyyyy-------
    xxxxxxxxxxxxxxxxxxxxxxx---
    zzzzzzzzzzzzzzzzzzzzzzzzzz
    zzzzzzzzzzzzzzzzzzzzzzzzzz 
    xxxxxxxxxxxxxxxxxxxxxxx---

    And it is ready for oligotyping or MED analysis. If this is 454 data, it gets complicated again. Because with 454, you have homopolymer region-associated indels. So your reads look like this, although they all end at an evolutionarily meaningful position (say 'o' identifies a homopolymer indel, like an extra G after 3 G's):

    xxxxxxxxxxxxxxxxxxxxxxx
    yyyyyyyyyyyyyyyyyyy
    yyyyyyyyyyyyyyoyyyyy
    xxxxxxxxxxxxxxxxxxxxxxx
    zzzzzzzzzzzzzzzzzzzzzzzzzz
    zzzzzzzzzzzzzzzozzzzzzzzzzz 
    xxxxxxxxxxxxxxxxxxxxxxx


    Now you can't just add gap characters at the end. Because the length variation is partially artificial. You need to align these reads to ameliorate that part of the length variation. When you use PyNAST against the GreenGenes template, it does something like this:

    xxxxxxxxxxxxxxxxxxxxxxx
    yyyyyyyyyyyyyy-yyyyy
    yyyyyyyyyyyyyyoyyyyy
    xxxxxxxxxxxxxxxxxxxxxxx
    zzzzzzzzzzzzzzz-zzzzzzzzzzz
    zzzzzzzzzzzzzzzozzzzzzzzzzz 
    xxxxxxxxxxxxxxxxxxxxxxx


    And the final form look like this without any trimming or padding, due to whichever template they hit:

    xxx----xxxxxxxxxxxxxxxxxxxx
    yy---yyyyyy--yyyyyy-yyy--yy
    yy---yyyyyy--yyyyyyoyyy--yy
    xxx----xxxxxxxxxxxxxxxxxxxx
    zzzzzzzzzzzzzzz-zzzzzzzzzzz
    zzzzzzzzzzzzzzzozzzzzzzzzzz 
    xxx----xxxxxxxxxxxxxxxxxxxx


    So, again it is ready to do for oligotyping or MED analysis.

I hope this makes sense.

Here is a recipe for the lazy:

  • "I have Illumina data, all of my reads are equal in length": You don't need to do anything, you can just oligotype / MED analyze that dataset.
  • "I have Illumina data, length of my reads vary, because the quality filtering I used trimmed them at random locations": Remove 10% of the shortest reads from your dataset, trim the remaining reads to the shortest one left in the dataset before the oligotyping / MED analysis.
  • "I have Illumina data, length of my reads vary, because I merged partially overlapping reads and trimmed them from the primer location": Pad your reads with gap characters before the oligotyping / MED analysis.

  • "I have 454 data, my reads are not aligned, but they are equal in length": Align them using PyNAST (because someone trimmed them in a meaningless way), then trim from the end of the alignment to have a nice rectangle filled with nucleotides (unlike the image above) before the oligotyping / MED analysis. 
  • "I have 454 data, the quality filtering I used trimmed them from random locations":  Remove 10% of the shortest reads from your dataset, trim the remaining reads to the shortest one left in the dataset, align them using PyNAST, then RE-TRIM the aligned reads to a have a nice rectangle filled with nucleotides before the oligotyping / MED analysis.
  • "I have 454 data, I trimmed them from the primer location": Align your reads using PyNAST, and you are ready to go.

I hope these make sense.


Best,

Joanito Liberti

unread,
Aug 17, 2014, 4:35:34 AM8/17/14
to oligo...@googlegroups.com
Dear Murat, 

thank you very much for your very detailed reply! I now understand much better what is going on with my dataset.

Basically the sequences I took are the ones right after the trim.seqs step (the one after denoising) in mothur (I took them so high up in the pipeline to maximize the total number of initial sequences for oligotyping). So those sequences have been filtered to remove seqs shorter than 200 bp and with homopolymers longer than 8 bp, the forwards primer was removed but NOT the reverse one. So when I align them with PyNast what happens is that I have 1) variable lengths at the end of the alignment because the reverse primer sequence is still there, 2) some sequences are longer than 200 bp but are not long enough to get to the reverse primer....

Now I can see that I have two (three) options (tell me if I'm wrong): 

I could just trim out the reverse primer from this dataset and perhaps filter out reads that are not long enough to get to that position... 
Or, I could start again by taking the sequences a bit further down in the pipeline, after the align.seqs and screen.seqs, so that they have already been filtered against the Silva database to remove misaligned sequences and to trim both ends at a precise position of the 16S alignment.... or I could go even a bit more down and take the sequences after chimera checking...
I see that I might have a tradeoff here, precision and quality of the dataset vs. total number of sequences that I input in the analysis. 

What would you suggest me to do then?

Thanks a lot again for your kind help!
Best,
Joanito

A. Murat Eren

unread,
Aug 17, 2014, 8:14:31 AM8/17/14
to oligo...@googlegroups.com
Hi Joanito,

I think the best approach would be to try each of the paths you mentioned and evaluate the gain.

But I would have removed primers first, and remove all the reads that didn't make it all the way to the primers and then align them using PyNAST / GreenGenes.


Best,

Elisa Ramos Sevillano

unread,
Jul 14, 2015, 9:44:36 AM7/14/15
to oligo...@googlegroups.com
Dear Joanito,

I am starting the analysis with the same Mothur output as you, and I was wondering If you could tell me which approach worked better for you.

thanks a lot
Elisa

picas...@gmail.com

unread,
Jun 24, 2016, 11:21:35 AM6/24/16
to Oligotyping and MED
Hi,

And if my case is :

I trimmed all my reads then I merged my trimmed reads.

What should I do.
Reply all
Reply to author
Forward
0 new messages