4 M/hr with 5 threads, a genome less than 3Mb and 30 mill pair end reads. Is that normal?

718 views
Skip to first unread message

Lorena Pantano

unread,
Feb 24, 2014, 9:56:40 AM2/24/14
to rna-...@googlegroups.com
Hi,

I want to use STAR because it seems very fast and with good accuracy. I have human RNAseq, around 30 mill pair reads, and I created a genome reference that is a 3Mb region of chr11.

I run STAR with 5 threads and I am getting a speed of 4 millions per hour. Is that normal? according to the paper, not, but maybe there is something wrong in my case. According to the command "ps" STAR is running all the time and using 5 CPUs.

any idea?

thanks alot

Alexander Dobin

unread,
Feb 24, 2014, 6:52:48 PM2/24/14
to rna-...@googlegroups.com
Hi Lorena,

this is definitely too slow. 
When generating genome files for a small genome, please try to use  --genomeSAindexNbases 10 (usually this causes a seg-fault,see this post).
If this does not resolve the speed issue, please send me your Log.out and Log.final.out files.

Cheers
Alex

Lorena Pantano

unread,
Feb 25, 2014, 7:05:22 AM2/25/14
to rna-...@googlegroups.com
Hi Alexander,

I generated the genome again with that parameter but nothing changed. I created a 4 mill lines fastq file just to avoid waiting for 6 hours. I used 2 threads, and run it in two different machines to remove the possibility of a unique problem, and the speed was 2.5 M/h.

I attach the two Log files.

thanks
Log.final.out
Log.out

Alexander Dobin

unread,
Feb 25, 2014, 4:27:41 PM2/25/14
to rna-...@googlegroups.com
Hi Lorena,

this is an interesting result, I actually I have not tried something like that before.
I suspect this is what is going on. You are mapping into a very small portion of the genome, so - not surprisingly - only a small percentage of reads map, <1%.
STAR wastes a huge amount of time on those 99% of the reads that do not map, breaking these unmappable sequences into very small seeds.
I believe if you try to map to the whole genome, you will see much higher speed (somewhat counter-intuitive), and higher mapping rate, of course. 

What is the reason for mapping into a small portion of the genome? In general, this is not a good idea, since you will be getting false positive alignments that are forced into to map into the restricted reference.

Cheers
Alex

Lorena Pantano

unread,
Feb 26, 2014, 3:31:18 AM2/26/14
to rna-...@googlegroups.com
Hi,

I totally agree with you.

The main reason is that that region is not exactly as the human reference, is a region where there is an inversion.  So I fake that region: put flanking regions as the reference genome, and the inversion in the middle.

My idea is to reconstruct later with cufflink to see if I can detect a new transcript due to the region: overlapping the breakpoints of the inversion, similar to a fusion gene.

So, I guess I would have to fake a reference genome with all inversions and map the 180 samples. I was trying to reduce to the minimum the running time due to the high number of samples.

I will test it, thanks again. If you have any other idea, please let me know

PS: what version is compatible with cufflinks? I tried with 2.3.1z that I got the error non  XS tag in spliced reads?

Lorena Pantano

unread,
Feb 26, 2014, 7:53:55 AM2/26/14
to rna-...@googlegroups.com

forget about XS question. RTFM helps, sorry I didn't find it at first.

Lorena Pantano

unread,
Feb 26, 2014, 10:45:04 AM2/26/14
to rna-...@googlegroups.com
and if you could explain me this parameter:
--seedPerWindowNmax

I get a good speed reducing this to 30 (125M/h). And I get the main reads supporting know genes with that. I just to want me make sure I don't doing a wrong assumption.

thanks!

Alexander Dobin

unread,
Feb 27, 2014, 4:10:32 PM2/27/14
to rna-...@googlegroups.com
Hi Lorena,

This is an interesting obcervation. Briefly, STAR maps a read as follows. 
1). short pieces of a read ("seeds") are mapped to the genome. The length of the seeds is not fixed, instead STAR increases seed length until first mismatch. At that point, a new seed is started, and the procedure is repeated until all of the read sequence is covered by seeds. 
2). Second, STAR defines genomic windows centered around "anchor seeds", which map fewer than --winAnchorMultimapNmax (50) times. The size of the window is determined by --alignIntronMax or --winBinNbits/--winAnchorDistNbins, by default it's ~600kb.
3). STAR collects all seed in each of the windows, and stitches (assembles) an alignment.

--seedPerWindowNmax (=50 by default) defines the maximum number of "seeds" that is allowed in each window. Usually this number is small, since it' basically equal to the total number of gaps, indels and mismatches in an alignment. However, in your case, the 99% of the reads did not originate from the short reference you are providing, which results in a large number of short seeds for each read. The mapping time scales exponentially with the number of seeds per window, so when you reduce --seedPerWindowNmax, the mapping speed increases.
Although this is a solution for the mapping speed problem, it seems a bit of hack, and it does not solve the underlying problem of incomplete reference. 

In my view, the best solution is to create reference by adding your your inversion sequences (with flanking regions) as separate "chromosomes" to the whole genome sequence. 
To avoid multi-mappers that map to both the inversion "chromosome", and the normal chromosomes, you can cut the inversion and flanking sequences from the normal chromosomes, or you can try to deal with these multi-mappers in post-processing.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages