Aligning to self?

Hari

unread,

Jul 17, 2014, 2:42:36 AM7/17/14

to last-...@googlegroups.com

Hi,

Is LAST capable of aligning to the same multifasta sequence.

Also, how do you suggest I run a self alignment (any specific m parameters? or seeds), I've created a -uMAM8 -c database because I'm trying to align the genome to itself to pick out regions of duplication, inversions and indels.

Thanks!

Hari

Martin Frith

unread,

Jul 17, 2014, 4:04:38 AM7/17/14

to Hari, last-...@googlegroups.com

Hello Hari,

Is LAST capable of aligning to the same multifasta sequence.

Yes.

Also, how do you suggest I run a self alignment (any specific m parameters? or seeds), I've created a -uMAM8 -c database because I'm trying to align the genome to itself to pick out regions of duplication, inversions and indels.

It depends whether you're mainly interested in recent duplications (which have high similarity), or ancient duplications (which have low similarity). I'd suggest to focus on recent duplications, which should be easier. You could do that using similar parameters to the human-chimp example (http://last.cbrc.jp/doc/last-tutorial.html):

lastdb -c -m1111110 mygenome mygenome.fa
lastal -q3 -e35 -f0 mygenome mygenome.fa > out.tab

(The human-chimp example has a lower score threshold, 30 instead of 35, because it then filters the alignments through last-split.)

I hope that helps,

have a nice day,

Martin Frith

http://www.cbrc.jp/~martin/

Hari

unread,

Jul 17, 2014, 10:34:46 AM7/17/14

to last-...@googlegroups.com, ranje...@gmail.com

Hi Martin,

I in fact intuitively tried the specified multiplicity option but however fell back to uMAM8 because my previous run on a workstation with 60Gb memory maxed out. Is this typical for a 2 Gbp genome? I'm going to move it to a 120Gb machine. Any estimates for run duration on single CPU?

By the way your clarification helped. On the other hand, how would the approach be different for ancient duplications. Would you be able to suggest me any method.

Thank you.

Hari

Martin Frith

unread,

Jul 19, 2014, 7:44:58 AM7/19/14

to Hari, last-...@googlegroups.com

Hi Hari,

if you're running out of memory, then I guess that's caused by repeat sequences (LINEs, SINEs, etc.) You can avoid that by masking them (i.e. converting to lowercase) before alignment. There are various repeat-masking tools, e.g. WindowMasker.

Run duration: I guess less than a day if repeats are masked, else could be much longer.

For ancient duplications, you could try something like this:

lastdb -c -uMAM8 mygenome mygenome.fa
lastal -e40 -m100 -f0 mygenome mygenome.fa > out.tab

Compared to the recent-duplications recipe, this will use more memory, be slower, and produce huger output.

Have a nice day,

Martin

--
You received this message because you are subscribed to the Google Groups "last-align" group.
To unsubscribe from this group and stop receiving emails from it, send an email to last-align+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hari

unread,

Jul 21, 2014, 4:19:12 AM7/21/14

to last-...@googlegroups.com, ranje...@gmail.com

The previous run completed and my left with a 40Gb tab output. I guess the dotplot program is going to fail on me. Any suggestions? I thought of awk'ing to produce only rows with hits to a particular scaffold but I may miss some important visualizations.

Hari

Martin Frith

unread,

Jul 22, 2014, 6:30:52 PM7/22/14

to Hari, last-...@googlegroups.com

Hello,

that's a huge output! You could try keeping, say, the top 1% of alignments with the highest scores.

Good luck,
Martin

Reply all

Reply to author

Forward