rna-star 2.4.2a run time drastically slow

1,762 views
Skip to first unread message

limeri...@gmail.com

unread,
Nov 4, 2015, 10:58:08 AM11/4/15
to rna-star
Hi Alex,

I've got a question regarding star runs which are running at a much slower rate of alignment than is the norm, I think.
Runs are >4hours on a 96GB, 24 core machine for paired-end reads with ~150million mate pairs. I've reinstalled the latest 2.4.2a star version and regenerated the genome.
The only difference to previous much faster runs is that I'm mapping only to one chromosome - chr5. I attached a few lines of the progress log below to give an idea of the run speeds.

Any speed up ideas would be greatly appreciated.

Sean.

=============================================================================================

Command calls used :

STAR --runThreadN 20 --runMode genomeGenerate --genomeDir ~/STAR_indices/hg19.chr5 --genomeFastaFiles ~/genomes/hg19.chr5.fa --sjdbGTFfile hg19.igenome_ucsc.chr5.gtf --sjdbOverhang 100

STAR --genomeDir ~/STAR_indices/hg19_chr5 --runThreadN 20 --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --sjdbGTFfile hg19.igenome_ucsc.chr5.gtf --readFilesCommand zcat --readFilesIn read1.fastq.gz read2.fastq.gz --twopassMode Basic --twopass1readsN -1 --outFileNamePrefix STAR_align/read. --outSAMattributes All

#####  I had also initially attempted running the job in 2pass mode as follows but had to abandon the run after >24 hours with alignment still in the first pass mapping phase. #####
STAR --genomeDir ~/STAR_indices/hg19_chr5 --runThreadN 20 --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --sjdbGTFfile hg19.igenome_ucsc.chr5.gtf --readFilesCommand zcat --readFilesIn read1.fastq.gz read2.fastq.gz --twopassMode Basic --twopass1readsN -1 --outFileNamePrefix STAR_align/read. --outSAMattributes All

==============================================================================================

Example Progress log:

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                          M/hr      number   length   unique   length      MMrate    multi      multi+       MM           short         other
Nov 03 15:15:12      0.2       67204      400     0.0%    299.0     2.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:16:12      0.7      268871      400     0.0%    299.0     2.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:37:23      0.8      604564      400     0.0%    299.0     2.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:38:36      1.1      872971      400     0.0%    300.0     2.2%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 15:59:35      1.0     1141374      400     0.0%    300.0     2.2%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 16:00:59      1.2     1409380      400     0.0%    300.0     2.2%     0.0%     0.0%     0.0%   100.0%     0.0%
...


limeri...@gmail.com

unread,
Nov 4, 2015, 12:57:08 PM11/4/15
to rna-star
Just to update - I reran the same task task using the full genome & progress is vastly improved. I still don't understand why aligning to a single chromosome is so slow.

==> hs_mfg_35do_full_genome.Log.progress.out <==
           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Nov 03 19:15:05      1.9       67132      400     0.0%    297.5     1.7%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 19:16:07     22.0     1141402      400     0.0%    298.9     1.4%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 19:17:17     29.1     2079319      400     0.0%    299.4     1.3%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 19:18:21     36.1     3218292      400     0.0%    299.9     1.2%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 19:19:22     37.3     3955169      400     0.0%    299.7     1.2%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 19:20:23     40.3     4960129      400     0.0%    299.6     1.2%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 03 19:21:33     40.9     5831052      400     0.0%    299.8     1.2%     0.0%     0.0%     0.0%   100.0%     0.0%

==> hs_mfg_35do.Log.progress.out <==
           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Nov 02 16:25:31 Started 1st pass mapping
Nov 02 18:37:08      0.0       67132      400     0.0%     -nan    -nan%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 02 18:38:19      0.1      268867      400     0.0%     -nan    -nan%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 02 18:39:31      0.2      470297      400     0.0%     -nan    -nan%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 02 18:40:34      0.3      738567      400     0.0%     -nan    -nan%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 02 18:41:45      0.5     1073964      400     0.0%    301.0     3.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 02 18:43:17      0.6     1275227      400     0.0%    301.0     3.0%     0.0%     0.0%     0.0%   100.0%     0.0%
Nov 02 20:49:01      0.3     1409375      400     0.0%    301.0     3.0%     0.0%     0.0%     0.0%   100.0%     0.0%

Alexander Dobin

unread,
Nov 4, 2015, 1:07:35 PM11/4/15
to rna-star
Hi Sean,

there is some discussion on this topic in these threads:
https://groups.google.com/d/msg/rna-star/cLpf7BuDnGY/nLXTE_pHDHgJ

Briefly, if you map to a small portion of the genome, STAR will waste a lot of time trying to find poor quality alignments for the reads that originate outside that small portion.
There are some ways to speed up such alignment, but the best way is to map to the most complete reference.

Cheers
Alex

limeri...@gmail.com

unread,
Nov 4, 2015, 1:13:32 PM11/4/15
to rna-star
Got it & thanks for the discussion thread links.

Dan

unread,
Nov 5, 2015, 5:20:38 AM11/5/15
to rna-star
Hi!

It would be great if STAR would have some kind of command line option (which would be something like "do not try to hard to align the reads because it is known beforehand less than 10% will map") for this kind of situation (where 90% of reads do not come from the reference genome to which are aligned)!

For example, this is useful when one would use STAR for filtering reads out when using reads from a human sample, as follows. 
1) Filtering step => map all the reads against all known viruses/bacteria/phages genomes (all what is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ ) <= here probably ~3% of reads map (and this is known a priori)
2) actual mapping on human genome => map on the human genome all the reads which didn't map at step 1

Filtering step has the advantage that one can see also the infection status of a patient (e.g. patient infected with HIV, hepatitis virus, etc.). This kind of approach does not work very really well with STAR due to its slowdown.

Also, it is not feasible to join the human genome and viruses/bacteria/phages genomes because:
- not always is a solution to concatenate all the reference genomes/sequences, and 
- the STAR's memory requirements explode (e.g. this "concatenated" genome would be well over 6 GB), and
- the viruses/bacteria/phages genomes (all what is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ ) are updated really frequently compared to the human genome and therefore it would require every month to build using STAR a new "concatenated" genome.

Cheers,
Dan

Alexander Dobin

unread,
Nov 5, 2015, 12:06:34 PM11/5/15
to rna-star
Hi Dan,

a solution was proposed here:
https://groups.google.com/d/msg/rna-star/hJL_DUtliCY/JeVzhm3Qv1QJ
If you reduce --seedPerWindowNmax to 30 (or maybe even 10), it should greatly speed-up the filtering step.
I have not done it myself, and I am not sure if it will work in your case.
If you try it, please let us know whether it worked.

Cheers
Alex

jordi vaquero

unread,
Mar 25, 2016, 3:35:19 AM3/25/16
to rna-star
Hi Alex, 
I have been dealing with that during last weeks, and I had the same problem. Changing to seedPerWindowNmax to 10 is not speeding up the execution, and it is working max 5M/h max speed. 
There is anything else I can try to fix that? 
From what I read in previous answers it seems the unmapped parts are stalling the execution, can we deactivate some features, like if no match just ignore?

Thanks

Jordi


El dijous, 5 novembre de 2015 12:06:34 UTC-5, Alexander Dobin va escriure:

Alexander Dobin

unread,
Mar 25, 2016, 3:56:15 AM3/25/16
to rna-star
Hi Jordi,

first let's make sure that you have the same problem as in the previous e-mails: only a small portion of your reads can be mapped to your reference.
Please describe your data (100k).

If you can send me a link to your reference and a small portion of your reads (100000), I will have a closer look.

Cheers
Alex

Jordi Vaquero

unread,
Mar 30, 2016, 5:53:17 PM3/30/16
to rna-star
Hello Alex, 
the date I am trying to parse are toxoplasma gondii rnaseq data. It is pairwise reads, 200 bp total, and a total of 150k reads in each experiment. 
I have tried to generate the genome with different option, as I have seen you answer to other threads, but none of them worked. 
Looking into the Log.progress.out, that is what I find, 
           ....

Mar 28 08:36:56      4.1   355363314      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 08:39:13      4.1   355595009      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 08:46:20      4.1   355826812      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 08:47:34      4.1   355942710      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 08:48:57      4.1   356174450      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 08:53:20      4.1   356406244      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 08:56:29      4.1   356638014      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 08:59:31      4.1   356869752      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:00:45      4.1   356985631      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:03:43      4.1   357101480      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:05:13      4.1   357217405      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:07:34      4.1   357449180      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:14:48      4.1   357680884      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:15:50      4.1   357796809      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:17:23      4.1   358028593      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:21:47      4.1   358260295      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:24:40      4.1   358492075      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:27:43      4.1   358723845      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:29:10      4.1   358839734      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:32:02      4.1   358955638      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:33:38      4.1   359071496      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:35:53      4.1   359303292      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:43:05      4.0   359535095      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:44:13      4.1   359650954      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:45:40      4.1   359882762      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:50:13      4.1   360114572      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:52:58      4.1   360346297      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:55:55      4.1   360578111      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 09:57:36      4.1   360693974      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 10:00:27      4.1   360809926      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 10:01:47      4.1   360925787      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%

Mar 28 10:04:13      4.1   361157600      202    16.9%    200.4     0.3%     0.5%     0.0%     0.0%    82.6%     0.0%


The mapping ratio moves from 30% to 15% and the speed is as good as 5.0 M/h.  I am not sure I can send you some data,since that is collaborators data and It may be confidential. 
Right know the execution is still running from last friday, where normal executions are finished in less than an hour. 

Thanks

Jordi

Alexander Dobin

unread,
Apr 1, 2016, 4:06:35 PM4/1/16
to rna-star
Hi Jordi,

do you know where the rest of the reads (70-85%) map? If they map to host species genome, then the most accurate way to increase the speed of mapping is to include the host species genome,

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages