rna-star 2.4.2a run time drastically slow

limeri...@gmail.com

unread,

Nov 4, 2015, 10:58:08 AM11/4/15

to rna-star

Hi Alex,

I've got a question regarding star runs which are running at a much slower rate of alignment than is the norm, I think.
Runs are >4hours on a 96GB, 24 core machine for paired-end reads with ~150million mate pairs. I've reinstalled the latest 2.4.2a star version and regenerated the genome.
The only difference to previous much faster runs is that I'm mapping only to one chromosome - chr5. I attached a few lines of the progress log below to give an idea of the run speeds.

Any speed up ideas would be greatly appreciated.

Sean.

=============================================================================================

Command calls used :

STAR --runThreadN 20 --runMode genomeGenerate --genomeDir ~/STAR_indices/hg19.chr5 --genomeFastaFiles ~/genomes/hg19.chr5.fa --sjdbGTFfile hg19.igenome_ucsc.chr5.gtf --sjdbOverhang 100

STAR --genomeDir ~/STAR_indices/hg19_chr5 --runThreadN 20 --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --sjdbGTFfile hg19.igenome_ucsc.chr5.gtf --readFilesCommand zcat --readFilesIn read1.fastq.gz read2.fastq.gz --twopassMode Basic --twopass1readsN -1 --outFileNamePrefix STAR_align/read. --outSAMattributes All

##### I had also initially attempted running the job in 2pass mode as follows but had to abandon the run after >24 hours with alignment still in the first pass mapping phase. #####
STAR --genomeDir ~/STAR_indices/hg19_chr5 --runThreadN 20 --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --sjdbGTFfile hg19.igenome_ucsc.chr5.gtf --readFilesCommand zcat --readFilesIn read1.fastq.gz read2.fastq.gz --twopassMode Basic --twopass1readsN -1 --outFileNamePrefix STAR_align/read. --outSAMattributes All

==============================================================================================

Example Progress log:

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                          M/hr      number   length   unique   length MMrate    multi      multi+       MM           short         other
Nov 03 15:15:12      0.2       67204      400     0.0%    299.0     2.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:16:12      0.7      268871      400     0.0%    299.0     2.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:37:23      0.8      604564      400     0.0%    299.0     2.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:38:36      1.1      872971      400     0.0%    300.0     2.2%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:59:35      1.0     1141374      400     0.0%    300.0     2.2%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 16:00:59      1.2     1409380      400     0.0%    300.0     2.2%     0.0%     0.0%     0.0%   100.0%     0.0%
...

limeri...@gmail.com

unread,

Nov 4, 2015, 12:57:08 PM11/4/15

to rna-star

Just to update - I reran the same task task using the full genome & progress is vastly improved. I still don't understand why aligning to a single chromosome is so slow.

==> hs_mfg_35do_full_genome.Log.progress.out <==

Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped

M/hr number length unique length MMrate multi multi+ MM short other

Nov 03 19:15:05 1.9 67132 400 0.0% 297.5 1.7% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 03 19:16:07 22.0 1141402 400 0.0% 298.9 1.4% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 03 19:17:17 29.1 2079319 400 0.0% 299.4 1.3% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 03 19:18:21 36.1 3218292 400 0.0% 299.9 1.2% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 03 19:19:22 37.3 3955169 400 0.0% 299.7 1.2% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 03 19:20:23 40.3 4960129 400 0.0% 299.6 1.2% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 03 19:21:33 40.9 5831052 400 0.0% 299.8 1.2% 0.0% 0.0% 0.0% 100.0% 0.0%

==> hs_mfg_35do.Log.progress.out <==

Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped

M/hr number length unique length MMrate multi multi+ MM short other

Nov 02 16:25:31 Started 1st pass mapping

Nov 02 18:37:08 0.0 67132 400 0.0% -nan -nan% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 02 18:38:19 0.1 268867 400 0.0% -nan -nan% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 02 18:39:31 0.2 470297 400 0.0% -nan -nan% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 02 18:40:34 0.3 738567 400 0.0% -nan -nan% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 02 18:41:45 0.5 1073964 400 0.0% 301.0 3.0% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 02 18:43:17 0.6 1275227 400 0.0% 301.0 3.0% 0.0% 0.0% 0.0% 100.0% 0.0%

Nov 02 20:49:01 0.3 1409375 400 0.0% 301.0 3.0% 0.0% 0.0% 0.0% 100.0% 0.0%

Alexander Dobin

unread,

Nov 4, 2015, 1:07:35 PM11/4/15

to rna-star

Hi Sean,

there is some discussion on this topic in these threads:

https://groups.google.com/d/msg/rna-star/hJL_DUtliCY/G1IOpvgx3H4J

https://groups.google.com/d/msg/rna-star/cLpf7BuDnGY/nLXTE_pHDHgJ

Briefly, if you map to a small portion of the genome, STAR will waste a lot of time trying to find poor quality alignments for the reads that originate outside that small portion.

There are some ways to speed up such alignment, but the best way is to map to the most complete reference.

Cheers

Alex

limeri...@gmail.com

unread,

Nov 4, 2015, 1:13:32 PM11/4/15

to rna-star

Got it & thanks for the discussion thread links.

Dan

unread,

Nov 5, 2015, 5:20:38 AM11/5/15

to rna-star

Hi!

It would be great if STAR would have some kind of command line option (which would be something like "do not try to hard to align the reads because it is known beforehand less than 10% will map") for this kind of situation (where 90% of reads do not come from the reference genome to which are aligned)!

For example, this is useful when one would use STAR for filtering reads out when using reads from a human sample, as follows.

1) Filtering step => map all the reads against all known viruses/bacteria/phages genomes (all what is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ ) <= here probably ~3% of reads map (and this is known a priori)

2) actual mapping on human genome => map on the human genome all the reads which didn't map at step 1

Filtering step has the advantage that one can see also the infection status of a patient (e.g. patient infected with HIV, hepatitis virus, etc.). This kind of approach does not work very really well with STAR due to its slowdown.

Also, it is not feasible to join the human genome and viruses/bacteria/phages genomes because:

- not always is a solution to concatenate all the reference genomes/sequences, and

- the STAR's memory requirements explode (e.g. this "concatenated" genome would be well over 6 GB), and

- the viruses/bacteria/phages genomes (all what is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ ) are updated really frequently compared to the human genome and therefore it would require every month to build using STAR a new "concatenated" genome.

Cheers,

Dan

Alexander Dobin

unread,

Nov 5, 2015, 12:06:34 PM11/5/15

to rna-star

Hi Dan,

a solution was proposed here:

https://groups.google.com/d/msg/rna-star/hJL_DUtliCY/JeVzhm3Qv1QJ
If you reduce --seedPerWindowNmax to 30 (or maybe even 10), it should greatly speed-up the filtering step.

I have not done it myself, and I am not sure if it will work in your case.

If you try it, please let us know whether it worked.

Cheers

Alex

jordi vaquero

unread,

Mar 25, 2016, 3:35:19 AM3/25/16

to rna-star

Hi Alex,

I have been dealing with that during last weeks, and I had the same problem. Changing to seedPerWindowNmax to 10 is not speeding up the execution, and it is working max 5M/h max speed.

There is anything else I can try to fix that?

From what I read in previous answers it seems the unmapped parts are stalling the execution, can we deactivate some features, like if no match just ignore?

Thanks

Jordi

El dijous, 5 novembre de 2015 12:06:34 UTC-5, Alexander Dobin va escriure:

Alexander Dobin

unread,

Mar 25, 2016, 3:56:15 AM3/25/16

to rna-star

Hi Jordi,

first let's make sure that you have the same problem as in the previous e-mails: only a small portion of your reads can be mapped to your reference.

Please describe your data (100k).

If you can send me a link to your reference and a small portion of your reads (100000), I will have a closer look.

Cheers

Alex

Jordi Vaquero

unread,

Mar 30, 2016, 5:53:17 PM3/30/16

to rna-star

Hello Alex,

the date I am trying to parse are toxoplasma gondii rnaseq data. It is pairwise reads, 200 bp total, and a total of 150k reads in each experiment.

I have tried to generate the genome with different option, as I have seen you answer to other threads, but none of them worked.

Looking into the Log.progress.out, that is what I find,

....

Mar 28 08:36:56 4.1 355363314 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 08:39:13 4.1 355595009 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 08:46:20 4.1 355826812 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 08:47:34 4.1 355942710 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 08:48:57 4.1 356174450 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 08:53:20 4.1 356406244 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 08:56:29 4.1 356638014 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 08:59:31 4.1 356869752 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:00:45 4.1 356985631 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:03:43 4.1 357101480 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:05:13 4.1 357217405 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:07:34 4.1 357449180 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:14:48 4.1 357680884 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:15:50 4.1 357796809 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:17:23 4.1 358028593 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:21:47 4.1 358260295 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:24:40 4.1 358492075 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:27:43 4.1 358723845 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:29:10 4.1 358839734 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:32:02 4.1 358955638 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:33:38 4.1 359071496 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:35:53 4.1 359303292 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:43:05 4.0 359535095 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:44:13 4.1 359650954 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:45:40 4.1 359882762 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:50:13 4.1 360114572 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:52:58 4.1 360346297 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:55:55 4.1 360578111 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 09:57:36 4.1 360693974 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 10:00:27 4.1 360809926 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 10:01:47 4.1 360925787 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%
Mar 28 10:04:13 4.1 361157600 202 16.9% 200.4 0.3% 0.5% 0.0% 0.0% 82.6% 0.0%

The mapping ratio moves from 30% to 15% and the speed is as good as 5.0 M/h. I am not sure I can send you some data,since that is collaborators data and It may be confidential.

Right know the execution is still running from last friday, where normal executions are finished in less than an hour.

Thanks

Jordi

Alexander Dobin

unread,

Apr 1, 2016, 4:06:35 PM4/1/16

to rna-star

Hi Jordi,

do you know where the rest of the reads (70-85%) map? If they map to host species genome, then the most accurate way to increase the speed of mapping is to include the host species genome,