Large number of "no gene_id for line" warnings - GFF3 file formatting problem?

Grant de Jong

unread,

Sep 19, 2017, 4:58:12 PM9/19/17

to rna-star

Hi Alex,

I'm having some difficulty generating a genome index for my GFF3 file (the format of which is below).

chrC03 GazeA2 mRNA 28541218 28543845 572.4227 + . ID=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001

chrC03 GazeA2 UTR 28543523 28543845 6.0158 + . Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001

chrC03 GazeA2 CDS 28543454 28543522 29.9339 + 0 Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001

chrC03 GazeA2 CDS 28543158 28543369 27.5481 + 1 Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001

chrC03 GazeA2 CDS 28542958 28543060 27.3743 + 0 Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001

I was wondering if you could provide some input on the following parameters:

/mnt/bay1/dejonggr/STAR/STAR --runMode genomeGenerate --genomeDir $1 --genomeFastaFiles $genfas --sjdbOverhang 99 --sjdbGTFfile $gtfdir --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentGene ID

The STAR run finishes successfully but I receive a large number of "no gene_id for line" warnings.

Do you have any idea what the problem might be?

Thanks in advance!

Grant

Log.out.trimmed

Grant de Jong

unread,

Sep 19, 2017, 5:56:08 PM9/19/17

to rna-...@googlegroups.com

I downloaded the genome from a different source (Ensembl) and I think the following command managed to fix the problem:

/mnt/bay1/dejonggr/STAR/STAR --runMode genomeGenerate \

--genomeDir $1 \

--genomeFastaFiles $genfas \

--sjdbOverhang 99 \

--sjdbGTFfile $gtfdir \

--sjdbGTFtagExonParentTranscript Parent \

--sjdbGTFtagExonParentGene ID

However, I was wondering how the sjbdGTFtag options related to each of the attributes. Since the attribute column for gene features (e.g. exons) are preceded by "Parent=transcript:", I'm assuming I include Parent as the tag. But I've also noticed that "ID=gene:" is used instead of gene_id, followed by "ID=transcript:" for the mRNA.

Should I still use "ID" for the ExonParentGene tag, or should I use "ID=gene"?

I'm sorry if this is a basic question. I truly wish there were some consistency among gtf/gff3 files.

Alexander Dobin

unread,

Sep 20, 2017, 8:53:37 AM9/20/17

to rna-star

Hi Grant,

it's best to convert this GFF file into GTF. For instance, you can use

$ gffread -T In.gff3 -o Out.gtf

Please check that in the resulting file the "CDS" lines (which you are using as features instead of "exons") have "gene_id" attributes.

Note that STAR will only consider CDS and not UTR. I think it's generally better to have a union of UTRs and CDSs, which are usually just called "exons".

Cheers

Alex

Grant de Jong

unread,

Sep 20, 2017, 2:43:32 PM9/20/17

to rna-star

The new file I'm using has CDS, exon, and UTR features within most genes. I assumed the default was "exon" so I didn't include anything else. Should I specify "exon"?

Also, I made an index for a converted gtf already, I'll just use that if you think it will work better.

Thanks,

Grant

Alexander Dobin

unread,

Sep 22, 2017, 12:42:06 PM9/22/17

to rna-star

Hi Grant,

the "exon" feature is indeed default.

Whether conversion to GTF works better than using the --sjdbGTFtagExonParentTranscript Parent --sjdbGTFtagExonParentGene ID depends on the file formatting.

Strictly speaking the GFF file is only required to follow the Parent hierarchy: exon's Parent is transript, transcript's Parent is gene. The GFF->GTF conversion uses just that information to add transcript_id and gene_id tags to exons.

The fact that "ID" in yout GFF file means gene ID is a courtesy from its creators. :)