mouse genome index with gtf from gencode

585 views
Skip to first unread message

Janet

unread,
Apr 17, 2017, 10:56:31 AM4/17/17
to rna-star
Hi there,

I tried to do genome index for mouse by using gtf from gencode, and then I mapped my reads to the indexed mouse DB. My commands are as follows:

STAR --runThreadN 32 --runMode genomeGenerate --genomeDir mouse_v87/STAR_db --genomeFastaFiles mouse_v87/Mus_musculus.GRCm38.dna.primary_assembly.fa --sjdbGTFfile gencode.vM13.primary_assembly.annotation.gtf --sjdbOverhang 100  --outFileNamePrefix mouse_v87/mouse_

STAR --runThreadN 32 --genomeDir mouse_v87/STAR_db --readFilesIn my_reads.fastq --outSAMprimaryFlag AllBestScore --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix my_reads_mouse.star

However, from one of the output files "my_reads_mouse.starReadsPerGene.out.tab" (attached here), only 86 genes are listed. I checked the indexed mouse DB folder, and found that the file "geneInfo.tab" only contains 86 genes. In fact, the gencode gtf file has 50686 genes. 


To test my commands, I used ensembl gtf file instead. Then this time, the "geneInfo.tab" file contains all 49671 genes. Because STAR recommends gtf from gencode for human and mouse genomes, I'm wondering how to properly index the genome with gtf from gencode. Thank you!

Best,
Janet
my_reads_mouse.starReadsPerGene.out.tab

Alexander Dobin

unread,
Apr 17, 2017, 4:24:31 PM4/17/17
to rna-star
Hi Janet,

this should not happen - please send me the Log.out file of this run.

Cheers
Alex

Janet

unread,
Apr 17, 2017, 4:50:26 PM4/17/17
to rna-star
Hi Alex,

I have already over-written the log file. I have to regenerate it and send to you later. Thanks!

-Janet

Dario Strbenac

unread,
Apr 18, 2017, 4:00:07 AM4/18/17
to rna-star
You must ensure that the genome sequence you're mapping to and the gene annotation have the same chromosome naming format.

  • The mouse genome you used contains chromosome names like 1, 2, 3 (e.g. >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF).
  • GENCODE Genes uses chromosome names like chr1, chr2, chr3 (e. g. chr1    ENSEMBL    gene    3102016    3102125    .    +    .).
  • The ENSEMBL gene database uses chromosome names like 1, 2, 3 (e.g. 1    ensembl    gene    3102016    3102125    .    +    .).

If you ensure that both the GENCODE Genes and genome sequence files use the same chromosome name format, you'll be able to successfully do your analysis with STAR.

Janet

unread,
Apr 18, 2017, 10:03:18 AM4/18/17
to rna-star
Hi Dario,

Yes. You are right. I checked the log file and the gencode gtf file today and found it is the naming format problem. I have to either rename the chromosome names or just use the ensembl gtf file. Thank you!

Best,
Janet

Alexander Dobin

unread,
Apr 18, 2017, 12:42:42 PM4/18/17
to rna-star
Good suggestion from Dario!
You could use the FASTA and GTF directly from the GENCODE - this will  ensure that chromosomes are named exactly the same in FASTA in GTF, e.g.

Cheers
Alex

Janet

unread,
Apr 20, 2017, 2:56:14 PM4/20/17
to rna-star
Hi Alex,

I see. I thought gencode gtf works good with ensembl. So for human and mouse, do you recommend to use both FASTA and GTF from gencode? Thanks.

Best,
Janet

Alexander Dobin

unread,
Apr 20, 2017, 3:20:34 PM4/20/17
to rna-star
Hi Janet,

Gencode and ENSEMBL files are very similar except for the chromosome naming. So it's best to use them consistently, i.e. GTF and FASTA from either one.
I would recommend Gencode over GTF as the former is compatible with UCSC genome browser.

Cheers
Alex

Janet

unread,
Apr 20, 2017, 4:05:27 PM4/20/17
to rna-star
Thank a lot for your replies, Alex. It's really helpful!

Best,
Janet
Reply all
Reply to author
Forward
0 new messages