slow mapping to a small genome

Marcin Cieślik

unread,

Apr 23, 2014, 4:50:54 PM4/23/14

to rna-...@googlegroups.com

Hi,

I am using STAR to estimate the amount of rRNA in our RNA-seq. Because there are many partial (retro-transposed) copies of rRNA in the human genome I first map to a small genome of the 45S and 5S sequences. Unfortunately this step is at least 5x slower than mapping to the full genome. I tried adjusting genomeSAindexNbases and the max that does not segfault is 6, but the speed improvement is very small (10%). I there anything that can be done about this?

Possibly one way would be to embed the 45S and 5S in random sequence to reach the need 14 index bases;). An alternative would be to include these sequences with the human reference, but then some rRNA reads might align (or actually will not align anywhere because of too many multi-maps) to the retro-transposed copies...

Thanks,

Marcin

Alexander Dobin

unread,

Apr 25, 2014, 12:14:10 PM4/25/14

to rna-...@googlegroups.com

Hi Marcin,

the problem is not in smallness of the genome per se, but rather in the incompleteness of the genome. Since majority of the reads will not map to the rRNA reference, STAR would be trying hard to place them, which slows down the mapping speed.

I think including the rRNA sequences with the human genome is the best option. You are right, it will increase the multi-mapping for some reads, however, you can deal with such alignments in the postprocessing. Most of rRNA alignments will be multi-mappers anyway.

Another option is to map to the standard genome first, and then re-map the unmapped reads (output with --outReadsUnmapped Fastx) to the rRNA reference.

Cheers

Alex

Marcin Cieślik

unread,

Apr 30, 2014, 7:26:03 AM4/30/14

to rna-...@googlegroups.com

Dear Alex,

Thanks for the explanation, makes sense now. Aligning to reference+rrna seems not a very good solution as there are 100s of partial rRNA sequences in the the reference, only very unreasonable multi-mapper settings would yield those reads as aligned. It is possible to mask the genomic sequences - again sub-optimal as they might be located in UTRs.