Generate genome index with GTF annotation --> slow mapping 2M/hr

Michael

unread,

Mar 7, 2014, 7:51:52 AM3/7/14

to rna-...@googlegroups.com

Hi,

I generated a genome index with following parameters with STAR version 2.3.1z:

STAR --runMode genomeGenerate --genomeDir $outFolder --genomeFastaFiles $path/mm10_allExtra.fa --runThreadN 60 --sjdbOverhang 99 --sjdbGTFfile $path/mm10/annotation/mm10_all_mRNA.gtf

Is it normal that the --sjdbGTFfile parameter does not show up in the "genomeParameters.txt" file ?

versionGenome    20201
genomeFastaFiles   $path/mm10_allExtra.fa
genomeSAindexNbases    14
genomeChrBinNbits    18
genomeSAsparseD    1
sjdbOverhang    99
sjdbFileChrStartEnd    -

When I use this genome index for mapping with 16 cores it is very slow (4.5M/hr - 0.9M/hr). When I use a index I generated without a annotation file I got speeds between 160M/hr - 200M/hr.

Any idea why mapping with the annotation index is so slow ?

Regards,
Michael

Alexander Dobin

unread,

Mar 7, 2014, 5:29:55 PM3/7/14

to rna-...@googlegroups.com

Hi Michael,

thanks for pointing out that --sjdbGTFfile is missing from the genomeParameters.txt file. At the moment STAR does not really need it at the mapping stage, it would be there only for informational purposes, but I will fix it.

The mapping speed reduction when using sjdb should be negligible, so something is definitely wrong.

How do the Log.final.out compare between the runs with or without annotations?

Is the mm10_all_mRNA.gtf file public, could you send it to me?

Cheers

Alex

Michael

unread,

Mar 10, 2014, 7:18:30 AM3/10/14

to rna-...@googlegroups.com

Am Freitag, 7. März 2014 23:29:55 UTC+1 schrieb Alexander Dobin:

Hi Michael,

thanks for pointing out that --sjdbGTFfile is missing from the genomeParameters.txt file. At the moment STAR does not really need it at the mapping stage, it would be there only for informational purposes, but I will fix it.

Ok.

The mapping speed reduction when using sjdb should be negligible, so something is definitely wrong.
How do the Log.final.out compare between the runs with or without annotations?

I stopped the one with the annotation because it was so slow. I started it again but I'm not sure when it will terminate. Process so far on 16 cores:

Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Mar 10 12:04:20      3.8      117299      200    63.3%    196.3     0.6%    12.8%     1.0%     0.0%    21.7%     1.3%
Mar 10 12:08:35      2.3      234542      200    63.4%    196.3     0.6%    12.8%     1.0%     0.0%    21.5%     1.3%

Is the mm10_all_mRNA.gtf file public, could you send it to me?

Yes, it is public. I downloaded it from here: ucsc table browser
Settings:
genome: mouse
assembly: mm10
group: mRNA and EST
table: all_mrna
region: genome
output format: gtf

I uploaded the file here

Greetings,
Michael

Michael

unread,

Mar 12, 2014, 9:50:29 AM3/12/14

to rna-...@googlegroups.com

I attached the log files you asked for.

Log.final.out

Alexander Dobin

unread,

Mar 13, 2014, 6:07:45 PM3/13/14

to rna-...@googlegroups.com

Hi Michael,

I have generated the genome with the mRNA.gtf file, and confirmed your observation that the mapping rate is extremely slow.

I think the explanation for that is as follows. A large fraction of the splice junctions extracted by STAR from this file (~133k out of ~412k) have introns <=0, i.e. the consecutive exons of the transcripts overlap. STAR expects that the exons that belong to the same transcript in the GTF file do not overlap. After I removed the short introns (<20b) with

$ awk '$3-$2>20 {print}' sjdbList.out.tab > sjdbList.out.tab.minIntron20 (filters out ~150k junctions),

and re-generated the genome with

--sjdbFileChrStartEnd sjdbList.out.tab.minIntron20

the mapping speed returned back to normal.

I think it's a good idea to use junctions from the mRNA as annotated. I would also recommend adding the standard annotated junctions (say from a Gencode GTF), you can use --sjdbFileChrStartEnd and --sjdbGTFfile simultaneously.

Cheers

Alex

Oscar Harari

unread,

Apr 15, 2014, 10:36:19 AM4/15/14

to rna-...@googlegroups.com

Hi Alex,

I am using STAR to examine RNAseq that includes 2 biological replicates. My approach is to firstly execute STAR and cufflinks in each experiments and then pic the transcripts that were predicted in common to augment the original annotation. The I re-run star with the extended annotations.

In the first round, STAR behave as usual, and for my configuration got ~33 M/hr.

I just don't understand why its speed its degraded to 12 M/hr.

I checked and I have very few introns (<20b).

Any suggestion?

Many thanks,

Oscar

Alexander Dobin

unread,

Apr 16, 2014, 4:24:39 PM4/16/14

to rna-...@googlegroups.com

Hi Oscar,

can you compare sjdbList.out.tab file with standard and "extended" (Cufflinks) annotations? How many additional junctions you are adding, how many of those are non-canonical?

Are there any junctions in the mitochondrion genome? Those are likely to be false and were reported to cause a slowdown.

If you send me the sjdbList.out.tab files and Log.final.out files from the two runs, I can have a closer look.