STAR is very slow

ERA

unread,

Feb 7, 2017, 10:10:24 AM2/7/17

to rna-...@googlegroups.com

Hi Alex,

Do you know what makes STAR anomaly very slow to run? STAR is still running to align one paired-end sample (30 million per sample, 100bp) to a reference genome of 17Gb size after almost two days. I used --runThreadN 24 and --limitBAMsortRAM 16000000000. These are output files;

-rw-r--r-- 1 0 Feb 5 18:19 Aligned.sortedByCoord.out.bam

-rw-r--r-- 1 4904174 Feb 5 19:06 Aligned.toTranscriptome.out.bam

-rwxr-xr-- 1 255152721 Feb 6 07:59 Log.out

-rw-r--r-- 1 5310 Feb 7 07:31 Log.progress.out

Any suggestion please.

Thanks,

ERA

Alexander Dobin

unread,

Feb 8, 2017, 5:07:49 PM2/8/17

to rna-star

Hi @ERA,

please send me the Log.progress.out and Log.out files. Does your genome have a lot of contigs/scaffolds, in addition to the large size? How much RAM do you have?

Cheers

Alex

ERA

unread,

Feb 9, 2017, 10:27:49 AM2/9/17

to rna-...@googlegroups.com

Hi Alex,

It’s still running. These are the Log.progress.out and Log.out files. I have 200Gb RAM.

grep -c "^>" refGenome.fa

7188907

Thanks,

ERA

Log.progress.zip

Alexander Dobin

unread,

Feb 9, 2017, 11:55:20 AM2/9/17

to rna-star

Hi @ERA,

the most likely cause of the slow-down is the large number of contigs in your genome. This happens typically when this number exceeds 50-100k, and you have ~7M.

Your mappability is very good even though ~50% of the reads are multi-mappers.

To increase the speed, you could concatenate most of your short contigs into one "super-contig" with some N-padding in between them.

For instance, you could sort your contigs by length, keep the longest 50k contigs separate, and concatenate the remaining 7M.

After mapping, you would need to convert the coordinates back from the super-contig into separate contigs.

If all of the short ones are shorter than 1kb, you can pad them with Ns so that they all are 1kb long - this will simplify the conversion.

Cheers

Alex

ERA

unread,

Feb 14, 2017, 11:24:22 AM2/14/17

to rna-star

Hi Alex,

Is it possible to index genome without annotation.gtf file? I lost the chromosome names corresponding to each sequence in the super-contig reference genome. I therefore cannot index it using annotation.gtf file.

Error message:

…

Feb 14 06:57:59 ... finished generating suffix array

Feb 14 06:57:59 ... generating Suffix Array index

Feb 14 07:04:24 ... completed Suffix Array index

Feb 14 07:04:24 ..... processing annotations GTF

Fatal INPUT FILE error, no valid exon lines in the GTF file: /work/annotation.gtf

Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

The super contig is composed of all short sequences (<1kb) being separated each other by N (30 times).

Cheers,

ERA

Alexander Dobin

unread,

Feb 14, 2017, 1:00:50 PM2/14/17

to rna-star

Hi @ERA,

sure, you can index the genome without GTF (just omit this option), however, you will lose sensitivity for splices with short overhangs.

In this case, I would strongly recommend using the 2-pass mapping.

Another simple option would be to filter out from the GTF file those annotations that reside on scaffolds that you combined into the super-contig.

The best option, of course, is to transform those coordinates into the super-contig coordinates.

Cheers

Alex

ERA

unread,

Feb 15, 2017, 4:26:51 PM2/15/17

to rna-star

Hi Alex,

I’d like to use the second option but I am not sure to understand the order of steps to follow for that. So, do you mean that I have to: 1- concatenate short contigs to one super contig; 2- filter out, from GTF file, annotations corresponding to contigs which compose the super-contig; 3- transform the coordinates of the filtered annotations into the super-contig coordinates (I ignore which the coordinates of the contig composing the super-contig are); 4- index the super-contig; and 5- map samples against the STAR indexed super contig? Sorry if I look like a novice because I am not yet very familiar with the high throughput sequencing data. I’d really appreciate if you could also suggest me an r package (or a simple script) allowing this coordinate conversion.

Thanks,

Era

Alexander Dobin

unread,

Feb 16, 2017, 4:58:51 PM2/16/17

to rna-star

Hi @ERA,

I would suggest the following steps:

1. Sort your contigs into two sets - long and short. For instance, you the 50,000 longest contigs are long, and the rest are short.

Concatenate the short contigs into one long super-contig sequence with a single name, but keep all the short contigs as is.

It's probably the best to keep all the long contigs in one file (say Long.fa), and the short one in a separate file (Short.fa).

While concatenating the short contigs, it might be helpful to pad each of the short contigs with Ns so they all have the same length.

Say, if all short contigs are shorter than 1000b, you can make them all to be 1000b. This will make it easier to transform the alignments form the super-contig to separate contigs.

For each of the short contigs, you may want to record its start cooridnate on the super-contig.

2. From the full GTF file, take only the lines which reside on long contig, i.e. filter *out* the lines that reside on short contigs.

2-alt. Alternatively, instead of simply removing annotations on short contigs, you can actually transform the coordinates of annotations (exons) from the contig to supercontig coordinates.

Basically, for each exon start/end you need to add the start position of the contig on the supercontig, which you have recorded in step 1.

Whichever path you choose, you will have a modified annotation file, AnnotModified.gtf . Column 1 of this file should only contain long contig and super-contig references, but not short contigs.

3. Generate the genome for the combined long and short contigs:

STAR ... --genomeFastaFiles Long.fa Short.fa --sjdbGTFfile AnnotModified.gtf

4. After the alignments are done, you may want to convert the super-contig alignments into the separate contig coordinates.

Hope this helps - please let me know if you have any questions

Cheers

Alex

Reply all

Reply to author

Forward