Hi Alex,
I'm having some difficulty generating a genome index for my GFF3 file (the format of which is below).
chrC03 GazeA2 mRNA 28541218 28543845 572.4227 + . ID=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001
chrC03 GazeA2 UTR 28543523 28543845 6.0158 + . Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001
chrC03 GazeA2 CDS 28543454 28543522 29.9339 + 0 Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001
chrC03 GazeA2 CDS 28543158 28543369 27.5481 + 1 Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001
chrC03 GazeA2 CDS 28542958 28543060 27.3743 + 0 Parent=BnaC03g43490D;Name=BnaC03g43490D;Alias=GSBRNA2T00158351001
I was wondering if you could provide some input on the following parameters:
/mnt/bay1/dejonggr/STAR/STAR --runMode genomeGenerate --genomeDir $1 --genomeFastaFiles $genfas --sjdbOverhang 99 --sjdbGTFfile $gtfdir --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentGene ID
The STAR run finishes successfully but I receive a large number of "no gene_id for line" warnings.
Do you have any idea what the problem might be?
Thanks in advance!
Grant
I downloaded the genome from a different source (Ensembl) and I think the following command managed to fix the problem:
/mnt/bay1/dejonggr/STAR/STAR --runMode genomeGenerate \
--genomeDir $1 \
--genomeFastaFiles $genfas \
--sjdbOverhang 99 \
--sjdbGTFfile $gtfdir \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbGTFtagExonParentGene ID
However, I was wondering how the sjbdGTFtag options related to each of the attributes. Since the attribute column for gene features (e.g. exons) are preceded by "Parent=transcript:", I'm assuming I include Parent as the tag. But I've also noticed that "ID=gene:" is used instead of gene_id, followed by "ID=transcript:" for the mRNA.
Should I still use "ID" for the ExonParentGene tag, or should I use "ID=gene"?
I'm sorry if this is a basic question. I truly wish there were some consistency among gtf/gff3 files.