Generate genome index with GTF annotation --> slow mapping 2M/hr

1,168 views
Skip to first unread message

Michael

unread,
Mar 7, 2014, 7:51:52 AM3/7/14
to rna-...@googlegroups.com
Hi,

I generated a genome index with following parameters with STAR version 2.3.1z:

STAR --runMode genomeGenerate --genomeDir $outFolder --genomeFastaFiles $path/mm10_allExtra.fa --runThreadN 60 --sjdbOverhang 99 --sjdbGTFfile $path/mm10/annotation/mm10_all_mRNA.gtf

Is it normal that the --sjdbGTFfile parameter does not show up in the "genomeParameters.txt" file ?

versionGenome    20201
genomeFastaFiles   $path/mm10_allExtra.fa
genomeSAindexNbases    14
genomeChrBinNbits    18
genomeSAsparseD    1
sjdbOverhang    99
sjdbFileChrStartEnd    -

When I use this genome index for mapping with 16 cores it is very slow (4.5M/hr - 0.9M/hr). When I use a index I generated without a annotation file I got speeds between 160M/hr - 200M/hr.

Any idea why mapping with the annotation index is so slow ?

Regards,
Michael

Alexander Dobin

unread,
Mar 7, 2014, 5:29:55 PM3/7/14
to rna-...@googlegroups.com
Hi Michael,

thanks for pointing out that  --sjdbGTFfile is missing from the genomeParameters.txt file. At the moment STAR does not really need it at the mapping stage, it would be there only for informational purposes, but I will fix it.
The mapping speed reduction when using sjdb should be negligible, so something is definitely wrong. 
How do the Log.final.out compare between the runs with or without annotations?
Is the mm10_all_mRNA.gtf file public, could you send it to me?

Cheers
Alex

Michael

unread,
Mar 10, 2014, 7:18:30 AM3/10/14
to rna-...@googlegroups.com


Am Freitag, 7. März 2014 23:29:55 UTC+1 schrieb Alexander Dobin:
Hi Michael,

thanks for pointing out that  --sjdbGTFfile is missing from the genomeParameters.txt file. At the moment STAR does not really need it at the mapping stage, it would be there only for informational purposes, but I will fix it.

Ok.
 
The mapping speed reduction when using sjdb should be negligible, so something is definitely wrong. 
How do the Log.final.out compare between the runs with or without annotations?

I stopped the one with the annotation because it was so slow. I started it again but I'm not sure when it will terminate. Process so far on 16 cores:

 Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Mar 10 12:04:20      3.8      117299      200    63.3%    196.3     0.6%    12.8%     1.0%     0.0%    21.7%     1.3%
Mar 10 12:08:35      2.3      234542      200    63.4%    196.3     0.6%    12.8%     1.0%     0.0%    21.5%     1.3%
 
Is the mm10_all_mRNA.gtf file public, could you send it to me?

Yes, it is public. I downloaded it from  here: ucsc table browser
Settings:
genome: mouse
assembly: mm10
group: mRNA and EST
table: all_mrna
region: genome
output format: gtf

I uploaded the file here

Greetings,
Michael

Michael

unread,
Mar 12, 2014, 9:50:29 AM3/12/14
to rna-...@googlegroups.com
I attached the log files you asked for.
Log.final.out
Log.final.out

Alexander Dobin

unread,
Mar 13, 2014, 6:07:45 PM3/13/14
to rna-...@googlegroups.com
Hi Michael,

I have generated the genome with the mRNA.gtf file, and confirmed your observation that the mapping rate is extremely slow.
I think the explanation for that is as follows. A large fraction of the splice junctions extracted by STAR from this file (~133k out of ~412k) have introns <=0, i.e. the consecutive exons of the transcripts overlap. STAR expects that the exons that belong to the same transcript in the GTF file do not overlap. After I removed the short introns (<20b) with 
$ awk '$3-$2>20 {print}'  sjdbList.out.tab > sjdbList.out.tab.minIntron20 (filters out ~150k junctions), 
and re-generated the genome with 
 --sjdbFileChrStartEnd sjdbList.out.tab.minIntron20
the mapping speed returned back to normal.

I think it's a good idea to use junctions from the mRNA as annotated. I would also recommend adding the standard annotated junctions (say from a Gencode GTF), you can use --sjdbFileChrStartEnd and --sjdbGTFfile simultaneously.

Cheers
Alex

Oscar Harari

unread,
Apr 15, 2014, 10:36:19 AM4/15/14
to rna-...@googlegroups.com
Hi Alex, 

I am using STAR to examine RNAseq that includes 2 biological replicates.  My approach is to firstly execute STAR and cufflinks in each experiments and then pic the transcripts that were predicted in common to augment the original annotation. The I re-run star with the extended annotations.
In the first round, STAR behave as usual, and for my configuration got ~33 M/hr.
I just don't understand why its speed its degraded to 12 M/hr. 
I checked and I have very few introns (<20b).
Any suggestion?

Many thanks, 
Oscar

Alexander Dobin

unread,
Apr 16, 2014, 4:24:39 PM4/16/14
to rna-...@googlegroups.com
Hi Oscar,

can you compare sjdbList.out.tab file with standard and "extended" (Cufflinks) annotations? How many additional junctions you are adding, how many of those are non-canonical?
Are there any junctions in the mitochondrion genome? Those are likely to be false and were reported to cause a slowdown.
If you send me the  sjdbList.out.tab files and Log.final.out files from the two runs, I can have a closer look.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages