lastal crashes when aligning pacbio reads to genome

Pedro Barbosa

unread,

May 8, 2017, 7:31:41 AM5/8/17

to last-align

Hello,

I'm trying to align error corrected pacbio reads (around 5 million) to a Illumina based draft assembly. Firstly, I run lastal in the local alignment mode, but I was getting an extremely high number of blocks aligned. As of 500Gb of lastal.out file size, I stopped the run.

lastal -fTAB -v -P10 -m20 -l5 ../index_draft/draft_coak ec_outCov90_minCov4.correctedReads.fasta

Then, I went for the for the overlap mode, changed the seeding scheme when creating the index and increased minimum length of initial matches.
lastdb -uNEAR -P10 draft_coak_NEAR draft-0.2.fasta
lastal -fTAB -P10 -m10 -l40 -T1 ../index_draft/corkoak_NEARindex ec_outCov90_minCov4.correctedReads.fasta > lastal.out

lastal stops after writing to ouptut 595,843 reads with this error:
bad symbol in sequence: $

I don't really know what does that mean, as I used this same fasta for other analysis without any problem. Do you also agree with the of overlapping mode approach ? Or do you suggest any other parameter settings for pacbio to achieve a great speed/sensibility trade-off ?

Thanks in advance,
Pedro

Frith, Martin

unread,

May 10, 2017, 11:00:18 PM5/10/17

to Pedro Barbosa, last-align

Hi Pedro,

I'm sorry for the slow reply. (Sometimes they make me prepare lectures and stuff.)

My main recommendation is to use last-split, which seeks a unique best alignment for each (part of each) read. That will reduce the output size.

I recommend not using overlap mode (except maybe in some very special situations). In practice it usually gives inferior results. Maybe the documentation should try harder to advise against it.

This is the currently-recommended way to align long reads:
https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md

It's for non-error-corrected reads, so for your case it will be unduly slow-and-sensitive.

I'm not familiar with your data, so the following is guesswork:

I would use the "without repeat-masking" DNA recipe at the above link.

I would add -k50 to the lastal options. Crudely speaking, this makes it 50x faster and 50x less sensitive. But I guess sensitive enough for you. (50 is a total guess: you could experiment with -k20, -k100, etc.)

I might add -C2 to the lastal options. This reduces the slowdown caused by unmasked repeats. But this option is recommended only if the indel rate is not too high.

Your idea to set a minimum match length is good: it has a similar effect to -k50, but the latter tends to be more effective.

As for the bad "$" symbol: your sequences should contain only a, c, g, t (and maybe ambiguous letters like n). Apparently there is a "$".

Have a nice day,

Martin
https://sites.google.com/site/mcfrith/

--
You received this message because you are subscribed to the Google Groups "last-align" group.
To unsubscribe from this group and stop receiving emails from it, send an email to last-align+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pedro Barbosa

unread,

May 16, 2017, 10:38:03 AM5/16/17

to Frith, Martin, last-align

Hello Martin,

Thanks for the suggestions. Indeed, I'm now able to finish in a reasonable amount of time with a decent output file. I needed to set a score threshold -e, because the scoring scheme was changed after I run last-train:

lastal -P8 -k50 -e75 -p coak_train.par ../index_draft/coakdb ec_outCov90_minCov4.correctedReads.fasta | last-split -m1e-6 > out.maf

However, other questions came up to my mind. I measured the number of reads with alignments and it appears to be very low (only 12% of the 5million). Anyway, the fraction of the reference genome covered (in bp) by the aligned reads is good, 75%, and 84% of the reference scaffolds have alignments. The unaligned ones are very short and only represent 3% of the total genome size. If this small set of aligned reads already spans a significant portion of the genome, why do so many reads remained fully unaligned? Is it possible that pacbio dataset holds huge unique regions which the reference lacks, or does the tweaking of lastal parameters might have had influence on this poor alignment rate? I'm running now again with -k25.

I further noticed that lastal splits fasta headers on the first space. The error corrected reads I have in hand were originated from the canu assembly pipeline, which has its own module for error correction. For some reads (in my case ~115k), it creates multiple sequences from the same original subread (I believe on those cases where canu found adaptors) with a slight difference in fasta header: e.g subread_name1 id=1_1; subread_name1 id=1_2. As this space was removed from the header, I believe last performs the alignment and reports the results of both duplicates. Is there a way to find out the alignments matching the original read IDs ?

Thanks,

Pedro

--
You received this message because you are subscribed to a topic in the Google Groups "last-align" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/last-align/QKRk2lQUJpc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to last-align+unsubscribe@googlegroups.com.

Frith, Martin

unread,

May 16, 2017, 9:19:39 PM5/16/17

to Pedro Barbosa, last-align

Hi Pedro,

why did you need to set -e? Is it due to this message...

can't calculate E-values: maybe the mismatch or gap costs are too weak.
To proceed without E-values, set a score threshold with option -e.
?

That would indicate that last-train's output is bad. Could you share the last 11 lines of your last-train output?

Spaces in unique identifiers: that's one of the most troublesome things in bioinformatics. Lots of formats use spaces as separators between fields. I would fix it by replacing the space with a colon or something (not tested):

sed '/>/s/ /:/' nasty.fasta > nice.fasta

(You could pipe this directly into lastal instead of writing "nice.fasta".)
See also: http://last.cbrc.jp/doc/FAQ.html

Have a nice day,

Martin

https://sites.google.com/site/mcfrith/

Pedro Barbosa

unread,

May 17, 2017, 5:54:28 AM5/17/17

to Frith, Martin, last-align

Yes, that was the message. Please find attached the whole training file.

The run with -k20 also finished, and the improvements were minimal.

Thanks,

Pedro

coak_train.par

Frith, Martin

unread,

May 18, 2017, 2:10:07 AM5/18/17

to Pedro Barbosa, last-align

Hi Pedro

I think your last-train output is OK, and instead there is a bug in the e-value routines. Many thanks for uncovering this. I'll report back if and when it gets fixed. In the meantime, setting -e is a good workaround. (But 75 seems a bit low: that will allow alignments as short as 10 bases.)

Why are there so many unaligned reads: I don't know. A conceivable reason is ambiguity: if a read aligns almost equally well to 2 places, it will be discarded. You could try replacing -m1e-6 with -m1, which keeps high-ambiguity alignments. (Each alignment's ambiguity is annotated with a "mismap" probability.)

Have a nice day,

Martin

https://sites.google.com/site/mcfrith/

Pedro Barbosa

unread,

May 18, 2017, 5:19:40 AM5/18/17

to Frith, Martin, last-align

Hello Martin,

If it may help, I can send you some reads from my dataset.

Thanks,

Pedro

Frith, Martin

unread,

Jun 15, 2017, 2:32:15 AM6/15/17

to Pedro Barbosa, last-align

Hello

almost a month later, the E-value problem is fixed in the latest LAST. That is, lastal will calculate E-values for this scoring scheme without complaint.

Have a nice day,

Martin

https://sites.google.com/site/mcfrith/

Reply all

Reply to author

Forward