NCBI's GFF file: how to get gene counts with STAR

2,317 views
Skip to first unread message

Alexey Kozlenkov

unread,
Jan 11, 2017, 9:33:47 AM1/11/17
to rna-star
Hello Alex,

I'm trying to use an NCBI-generated GFF file to do RNA-seq mapping with STAR.
The GFF was downloaded from here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/515/GCF_000001515.7_Pan_tro_3.0/GCF_000001515.7_Pan_tro_3.0_genomic.gff.gz
and the first few lines are as follows:
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Pan_tro 3.0
#!genome-build-accession NCBI_Assembly:GCF_000001515.7
##sequence-region NC_006468.4 1 228573443
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9598
NC_006468.4    RefSeq    region    1    228573443    .    +    .    ID=id0;Dbxref=taxon:9598;Name=1;chromosome=1;gbkey=Src;genome=chromosome;isolate=Yerkes chimp pedigree #C0471 (Clint);mol_type=genomic DNA;sex=male
NC_006468.4    Gnomon    gene    29601    30855    .    +    .    ID=gene0;Dbxref=GeneID:107974325;Name=LOC107974325;gbkey=Gene;gene=LOC107974325;gene_biotype=lncRNA
NC_006468.4    Gnomon    lnc_RNA    29601    30855    .    +    .    ID=rna0;Parent=gene0;Dbxref=GeneID:107974325,Genbank:XR_001716999.1;Name=XR_001716999.1;gbkey=ncRNA;gene=LOC107974325;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=uncharacterized LOC107974325;transcript_id=XR_001716999.1
NC_006468.4    Gnomon    exon    29601    29875    .    +    .    ID=id1;Parent=rna0;Dbxref=GeneID:107974325,Genbank:XR_001716999.1;gbkey=ncRNA;gene=LOC107974325;product=uncharacterized LOC107974325;transcript_id=XR_001716999.1

When I just use this file as is, generating a genome index with an option:
--sjdbGTFtagExonParentTranscript Parent \
and then proceed to the mapping step, I don't get any reads assigned to genes or transcripts in my "ReadsPerGene.out.tab" output file:

N_unmapped      2925013 2925013 2925013
N_multimapping  2346918 2346918 2346918
N_noFeature     26462311        26926355        35319350
N_ambiguous     0       0       0


I tried to convert the GFF file to GTF, using Cufflinks' gffread:
gffread -T GCF_000001515.7_Pan_tro_3.0_genomic.gff -o GCF_000001515.7_Pan_tro_3.0_genomic.gtf

However, the resulting GTF doesn't include any gene names:
NC_001643.1    RefSeq    exon    1    71    .    +    .    transcript_id "rna99926"; gbkey "tRNA"; product "tRNA-Phe"; gbkey "tRNA"; product "tRNA-Phe";
NC_001643.1    RefSeq    exon    72    1020    .    +    .    transcript_id "rna99927"; gbkey "rRNA"; product "12S ribosomal RNA"; gbkey "rRNA"; product "12S ribosomal RNA";
NC_001643.1    RefSeq    exon    1021    1089    .    +    .    transcript_id "rna99928"; gbkey "tRNA"; product "tRNA-Val"; gbkey "tRNA"; product "tRNA-Val";

What I'm trying to achieve is to obtain gene counts with --quantMode GeneCounts after mapping.

Do you have any suggestions for the optimal way to proceed here?
Shall I try some other options with the "gffread" command, or is there a better way in STAR with "--sjdbGTFtagExonParentTranscript" or --sjdbGTFtagExonParentGene" options?

Thank you!
Alexey

Alexander Dobin

unread,
Jan 12, 2017, 4:22:19 PM1/12/17
to rna-star
Hi Alexey,

this gff file seems to have a messed up hierarchical structure.The correct structure should be
exon (col3) with Parent pointing to transcript
transcript (col3) with  Parent=gene
gene (col3)

The GFF file has a lot of different entries in the col3:
1214319 exon
1011662 CDS
  79947 mRNA
  39699 gene
  11950 lnc_RNA
   6312 transcript
   1095 match
    915 cDNA_match
    651 miRNA
    642 primary_transcript
    445 tRNA
    189 V_gene_segment
     22 C_gene_segment
      2 rRNA
      1 antisense_RNA
      1 D_loop

The exon -> transcript -> gene hierarchy is not obvious here, this is what likely confuses gffread.

I would suggest that you parse this file yourself and create the GTF file.
You can start with the exon lines and treat their parent as transcripts - add "transcript_id" attribute to them.
Then you can find the these Parent lines and treat their Parents as genes, and add the "gene_id" tags to the exon lines.
This will give a gtf file with exon lines that each have "transcript_id" and "gene_id" tags - this is all STAR needs.

You have to be careful and check that you are not missing anything. For instance, I would check that CDS lines are always contained within the exon lines.
Sometimes the CDS lines are not replicated as exons, and so you may miss the important coding exons.
Also, I would check that the Parent relationships are reasonable.

Cheers
Alex

Alexey Kozlenkov

unread,
Jan 12, 2017, 4:47:19 PM1/12/17
to rna-star
Hi Alex,

Thank you, very helpful !

Can you please clarify what you mean by "CDS lines contained within the exon lines"?
Do they always have to come in pairs?  (a CDS entry and "exon" entry?)

If I only have a CDS entry in my GFF file, can I just create an exon entry based on it?

Best,
Alexey

Alexander Dobin

unread,
Jan 12, 2017, 5:06:35 PM1/12/17
to rna-star
Hi Alexey,

with GTF strict specifications (which GFF do not always adhere to), each "CDS" has to be contained in an "exon", even if they coincide.
Reverse is not true for non-coding exons, i.e. there may be "exon" lines without a "CDS" line.

It's hard to tell if orphan CDS lines should be included as "exon" lines, since the "exon" lines determine the junction gaps, so you may create spurious junctions.
However, not including them will result in missing the annotated junctions and portions of genes.
I think in most cases it's safer to include them.

Cheers
Alex

Alexey Kozlenkov

unread,
Jan 12, 2017, 9:10:04 PM1/12/17
to rna-star
Hi Alex,

Upon closer examination, the GTF file generated from GFF was in fact not too bad... The exception is a few lines in the beginning, where for some genes there were no "exon" entries in the GFF file already. (Those few first lines is what I noticed first, and decided that "ggfread" did not work. I apologize for not checking it earlier.)

Interestingly, these "failed" lines at the top of the file seem to come from the mitochondrial genome.
Example of such gene in the GFF file:
grep "=ND1;" GCF_000001515.7_Pan_tro_3.0_genomic.gff

NC_001643.1    RefSeq    gene    2725    3681    .    +    .    ID=gene39686;Dbxref=GeneID:807867;Name=ND1;gbkey=Gene;gene=ND1;gene_biotype=protein_coding;partial=true;start_range=.,2725
NC_001643.1    RefSeq    CDS    2725    3681    .    +    0    ID=cds80154;Parent=gene39686;Dbxref=Genbank:NP_008186.1,GeneID:807867;Name=NP_008186.1;gbkey=CDS;gene=ND1;partial=true;product=NADH dehydrogenase subunit 1;protein_id=NP_008186.1;start_range=.,2725;transl_table=2

As you can see, there is no "exon" entry for the gene.
Here the NC_001643.1 is in fact the NCBI notation for chrM. (was not obvious!). Apparently, the entries for several mitochondrial genes are not called "exons" in this GFF, which sort of makes sense, I guess. And that's how they end up not having a correct "exon" entry in the GTF file, though they have a CDS entry.
Apart from these cases, the rest of the GTF file seems legit. I will double-check and might try to correct the missing entries, as you suggested.

In the meantime, I've also generated the index with this GTF file, using the option: 
--sjdbGTFtagExonParentGene gene_name

and then did the mapping, again including the same option.

The result now seems correct, with many counts annotated to genes as it should be.

So, I guess, this can be considered "solved"!
Thanks again for your help,

Best,
Alexey

陈泽宇

unread,
Jan 13, 2017, 12:46:10 PM1/13/17
to rna-star
Hi,Alex
it seems that i also have this problem
I use this gff3 on NCBI.I just use mitochondria and chloroplast annoations,just NC_011033.1 and NC_001320.1.
This is my index:



This is my command:

/public/home/zychen/tools/star/STAR-2.5.2b/bin/Linux_x86_64/STAR   --runMode genomeGenerate   --runThreadN 10   --genomeDir /public/home/zychen/staralignments/genomeDir/   --genomeFastaFiles /public/home/zychen/rna-seq/filterfq2085/rice_MC1314_genomes_v7.fasta      --sjdbGTFfile /public/home/zychen/rna-seq/filterfq2085/chrMC1314.gff3   --sjdbGTFtagExonParentTranscript Parent


This is chrNC1314,gff3 photo:



what my problem is my index file sjdbList.out.tab:

just have chr14 information:

[zychen@login genomeDir]$ cat sjdbList.out.tab

chr14 1398 3901 -

chr14 4636 5513 -

chr14 32886 33713 +

chr14 42007 42738 -

chr14 42969 43713 -

chr14 46593 47132 +

chr14 71238 72048 +

chr14 72891 73635 +

chr14 78131 79189 -

chr14 86151 86862 -

chr14 88502 89041 -

chr14 93137 94083 +

chr14 94221 95032 +

chr14 111170 112156 -

chr14 120086 120897 -

chr14 121035 121981 -

chr14 126077 126616 +

chr14 128256 128967 +

chr14 132845 133507 +

########################################
I couldn't find any way to solve my problem,give my advice ,please!Thank you !

Alexander Dobin

unread,
Jan 13, 2017, 5:00:58 PM1/13/17
to rna-star
Hi Alexey,

thanks for sharing the solution.
Did gffread generate the "gene_name" tags instead the "gene_id". I thought the latter was the standard for GTF files.
Or did it add both tags? I think it's safer to use the default --sjdbGTFtagExonParentGene gene_id since, in principle, the gene_name tags are not guaranteed to be unique.

Cheers
Alex


On Thursday, January 12, 2017 at 9:10:04 PM UTC-5, Alexey Kozlenkov wrote:
Hi Alex,

Upon closer examination, the GTF file generated from GFF was in fact not too bad... The exception is a few lines in the beginning, where for some genes there were no "exon" entries in the GFF file already. (Those few first lines is what I noticed first, and decided that "ggfread" did not work. I apologize for not checking it earlier.)

Interestingly, these "failed" lines at the top of the file seem to come from the mitochondrial genome.
Example of such gene in the GFF file:
grep "=ND1;" GCF_000001515.7_Pan_tro_3.0_genomic.gff

NC_001643.1    RefSeq    gene    2725    3681    .    +    .    ID=gene39686;Dbxref=GeneID:807867;Name=ND1;gbkey=Gene;gene=ND1;gene_biotype=protein_coding;partial=true;start_range=.,2725
NC_001643.1    RefSeq    CDS    2725    3681    .    +    0    ID=cds80154;Parent=gene39686;Dbxref=Genbank:NP_008186.1,GeneID:807867;Name=NP_008186.1;gbkey=CDS;gene=ND1;partial=true;product=NADH dehydrogenase subunit 1;protein_id=NP_008186.1;start_range=.,2725;transl_table=2

As you can see, there is no "exon" entry for the gene.
Here the NC_001643.1 is in fact the NCBI notation for chrM. (was not obvious!). Apparently, the entries for several mitochondrial genes are not called "exons" in this GFF, which sort of makes sense, I guess. And that's how they end up not having a correct "exon" entry in the GTF file, though they have a CDS entry.
Apart from these cases, the rest of the GTF file seems legit. I will double-check and might try to correct the missing entries, as you suggested.

In the meantime, I've also generated the index with this GTF file, using the option: 
c

Alexander Dobin

unread,
Jan 13, 2017, 5:06:11 PM1/13/17
to rna-star
Hi,

how did you convert the GFF file at your link (which has references name NC_*) into the chrMC1314.gff3 that you fed to STAR?
You need to be careful to convert the names so that they agree with the chr names in the FASTA file.
If this does not help, please send me the Log.out file.

Cheers
Alex

Alexey Kozlenkov

unread,
Jan 14, 2017, 12:44:49 PM1/14/17
to rna-star
Hi Alex,

For most entries, gffread added both "gene_id" and "gene_name" tags.

However, the "gene_id" was very uninformative, basically the "gene_id" names were like "gene0", "gene1".

An example of an exon of the gene "VWA1":
In the GFF file:
grep "VWA1" GCF_000001515.7_Pan_tro_3.0_genomic.gff | grep "630453"

NC_006468.4    Gnomon    gene    630453    637801    .    +    .    ID=gene47;Dbxref=GeneID:745278;Name=VWA1;gbkey=Gene;gene=VWA1;gene_biotype=protein_coding
NC_006468.4    Gnomon    mRNA    630453    637801    .    +    .    ID=rna115;Parent=gene47;Dbxref=GeneID:745278,Genbank:XM_016952138.1;Name=XM_016952138.1;gbkey=mRNA;gene=VWA1;model_evidence=Supporting evidence includes similarity to: 6 mRNAs%2C 751 ESTs%2C 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 55 samples with support for all annotated introns;product=von Willebrand factor A domain containing 1%2C transcript variant X1;transcript_id=XM_016952138.1
NC_006468.4    Gnomon    exon    630453    630746    .    +    .    ID=id1048;Parent=rna115;Dbxref=GeneID:745278,Genbank:XM_016952138.1;gbkey=mRNA;gene=VWA1;product=von Willebrand factor A domain containing 1%2C transcript variant X1;transcript_id=XM_016952138.1
NC_006468.4    Gnomon    mRNA    630453    637801    .    +    .    ID=rna116;Parent=gene47;Dbxref=GeneID:745278,Genbank:XM_016952145.1;Name=XM_016952145.1;gbkey=mRNA;gene=VWA1;model_evidence=Supporting evidence includes similarity to: 2 mRNAs%2C 442 ESTs%2C 1 Protein%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments;product=von Willebrand factor A domain containing 1%2C transcript variant X2;transcript_id=XM_016952145.1
NC_006468.4    Gnomon    exon    630453    630746    .    +    .    ID=id1051;Parent=rna116;Dbxref=GeneID:745278,Genbank:XM_016952145.1;gbkey=mRNA;gene=VWA1;product=von Willebrand factor A domain containing 1%2C transcript variant X2;transcript_id=XM_016952145.1


in the resulting GTF file:
grep "VWA1" GCF_000001515.7_Pan_tro_3.0_genomic.gtf | grep "630453"

NC_006468.4    Gnomon    exon    630453    630746    .    +    .    transcript_id "rna115"; gene_id "gene47"; gene_name "VWA1";
NC_006468.4    Gnomon    exon    630453    630746    .    +    .    transcript_id "rna116"; gene_id "gene47"; gene_name "VWA1";

That's why I had to use the option "--sjdbGTFtagExonParentGene gene_name" instead of "--sjdbGTFtagExonParentGene gene_id" during both indexing and mapping. Otherwise I end up with gene counts associated to gene names like "gene0", "gene1".
By the way, I did not change the default settings for "--sjdbGTFtagExonParentTranscript"

Best,
Alexey

Alexander Dobin

unread,
Jan 18, 2017, 3:18:39 PM1/18/17
to rna-star
Hi Alexey,

using --sjdbGTFtagExonParentGene gene_name is fine as long as you are sure that every exon has "gene_name" tag, and this tag is unique to each gene (i.e. different genes have different "gene_name"s). GTF specs guarantee rhese properties only for the "gene_id" tag.

STAR defaults --sjdbGTFtagExonParentTranscript to "transcript_id" which is guaranteed to be present and unique in GTF.

Cheers
Alex

usurp

unread,
Apr 9, 2018, 10:36:53 AM4/9/18
to rna-star
Hi Alexy

I also ran into the same problem when using NCBI's gff3 file. The problem seems to be because of the mitochondrial annotations. 

Did you then ignore all the mitochondrial genes altogether? I was thinking of modifying the annotations a bit to convert the mitochondrial genes such that they are recognized as exons by STAR. 

Alexander Dobin

unread,
Apr 9, 2018, 5:08:08 PM4/9/18
to rna-star
Hi @usurp,

mitochondrion genes are not spliced, so STAR does not need to extract the annotated junctions from them.
You also would need the annotations if you are using --quantMode options to convert the alignment to transcriptomic or count reads per gene, if you are interested in mitochondrion genes.Otherwise is safe to drop them altogether.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages