No alignments generated for long reads (simulated PacBio)

Darya Filippova

unread,

Dec 8, 2015, 5:09:57 PM12/8/15

to rna-star

Hi,

I used a pbsim tool to generate PacBio long reads (avg length = 2411bp) and now I am trying to align these to a viral genome (len=9400bp or so). I went though previous posts related to mapping the long reads and tried various parameters for generate alignments, STARlong completes the run, however, every time all reads are placed into "reads unmapped: other" group (i.e. 100% end up in the group). Here is the command I used:

STARlong --runThreadN 4 --runMode alignReads --genomeDir data/star_genome_index_HXB2 --readFilesIn data/sampled_reads_mix_HXB2.fasta_90pct.fa --winAnchorMultimapNmax 200 --seedPerReadNmax 10000 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.66 --outFilterMismatchNmax 1000 --seedSearchStartLmax 50 --alignTranscriptsPerWindowNmax 100 --seedPerWindowNmax 100 --alignTranscriptsPerReadNmax 100 --alignWindowsPerReadNmax 20000

Any suggestions on what I can do to align these reads?

Thanks

Alexander Dobin

unread,

Dec 9, 2015, 5:08:12 PM12/9/15

to rna-star

Hi Darya,

are you simulating reads with high error rates (similar to raw PacBio reads), or with a low-medium error rates (similar to PacBio CCS reads)?

STAR will not map the former, however, it should work with the latter - in which case please send me the genome fasta and a few reads for testing.

Cheers

Alex

Darya Filippova

unread,

Dec 9, 2015, 7:10:45 PM12/9/15

to rna-star

Alex, great catch -- I was generating subreads instead of CCS by mistake. Now that I am simulating CCS reads, I get 41.97% uniquely mapped, 49.44% "unmapped: other" and 8.36% "unmapped: too short". Does that sound like a reasonable mapping rate (I am used to higher 90ties with Illumina).

Alexander Dobin

unread,

Dec 10, 2015, 4:21:18 PM12/10/15

to rna-star

Hi Darya,

the % if unique mappers seems to be on the low side, for real PacBio CCS reads we typically get >60%, and most of the unmapped reads are "too short" rather than "other".

Of course, this dependes strongly on your simulation parameters, such as the number of consensus passes per read. Typically, reads with only 2 passes do not align well, so if you have a large proportion of 2-passes read, it could drive down the % of uniquely mapped reads.

You could also tweak your parameters a bit to see if this would better alignment rate, my recommendation

reduce --seedSearchStartLmax to 30 or even less, down to 12

this should increase sensitivity of search for reads with higher error rates (i.e. fewer passes)
--seedPerReadNmax 100000 --seedPerWindowNmax 10000 --alignTranscriptsPerReadNmax 100000 --alignTranscriptsPerWindowNmax 10000

not sure if this will have a big effect for a very small genome

Cheers

Alex

Darya Filippova

unread,

Dec 11, 2015, 5:57:35 PM12/11/15

to rna-star

Alex,

I played around w/ simulation parameters where I now allow for less mismatches and am generating longer reads, so % uniquely mapped is now at 88 - 89% for depth=10K dataset and 91-94% for depth=50 dataset. Thanks for your help!

Reply all

Reply to author

Forward