##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Pan_tro 3.0
#!genome-build-accession NCBI_Assembly:GCF_000001515.7
##sequence-region NC_006468.4 1 228573443
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9598
NC_006468.4 RefSeq region 1 228573443 . + . ID=id0;Dbxref=taxon:9598;Name=1;chromosome=1;gbkey=Src;genome=chromosome;isolate=Yerkes chimp pedigree #C0471 (Clint);mol_type=genomic DNA;sex=male
NC_006468.4 Gnomon gene 29601 30855 . + . ID=gene0;Dbxref=GeneID:107974325;Name=LOC107974325;gbkey=Gene;gene=LOC107974325;gene_biotype=lncRNA
NC_006468.4 Gnomon lnc_RNA 29601 30855 . + . ID=rna0;Parent=gene0;Dbxref=GeneID:107974325,Genbank:XR_001716999.1;Name=XR_001716999.1;gbkey=ncRNA;gene=LOC107974325;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=uncharacterized LOC107974325;transcript_id=XR_001716999.1
NC_006468.4 Gnomon exon 29601 29875 . + . ID=id1;Parent=rna0;Dbxref=GeneID:107974325,Genbank:XR_001716999.1;gbkey=ncRNA;gene=LOC107974325;product=uncharacterized LOC107974325;transcript_id=XR_001716999.1
N_unmapped 2925013 2925013 2925013
N_multimapping 2346918 2346918 2346918
N_noFeature 26462311 26926355 35319350
N_ambiguous 0 0 0
gffread -T GCF_000001515.7_Pan_tro_3.0_genomic.gff -o GCF_000001515.7_Pan_tro_3.0_genomic.gtf
NC_001643.1 RefSeq exon 1 71 . + . transcript_id "rna99926"; gbkey "tRNA"; product "tRNA-Phe"; gbkey "tRNA"; product "tRNA-Phe";
NC_001643.1 RefSeq exon 72 1020 . + . transcript_id "rna99927"; gbkey "rRNA"; product "12S ribosomal RNA"; gbkey "rRNA"; product "12S ribosomal RNA";
NC_001643.1 RefSeq exon 1021 1089 . + . transcript_id "rna99928"; gbkey "tRNA"; product "tRNA-Val"; gbkey "tRNA"; product "tRNA-Val";
--sjdbGTFtagExonParentGene gene_name
This is my command:
/public/home/zychen/tools/star/STAR-2.5.2b/bin/Linux_x86_64/STAR --runMode genomeGenerate --runThreadN 10 --genomeDir /public/home/zychen/staralignments/genomeDir/ --genomeFastaFiles /public/home/zychen/rna-seq/filterfq2085/rice_MC1314_genomes_v7.fasta --sjdbGTFfile /public/home/zychen/rna-seq/filterfq2085/chrMC1314.gff3 --sjdbGTFtagExonParentTranscript Parent
This is chrNC1314,gff3 photo:
what my problem is my index file sjdbList.out.tab:
just have chr14 information:
[zychen@login genomeDir]$ cat sjdbList.out.tab
chr14 1398 3901 -
chr14 4636 5513 -
chr14 32886 33713 +
chr14 42007 42738 -
chr14 42969 43713 -
chr14 46593 47132 +
chr14 71238 72048 +
chr14 72891 73635 +
chr14 78131 79189 -
chr14 86151 86862 -
chr14 88502 89041 -
chr14 93137 94083 +
chr14 94221 95032 +
chr14 111170 112156 -
chr14 120086 120897 -
chr14 121035 121981 -
chr14 126077 126616 +
chr14 128256 128967 +
chr14 132845 133507 +
Hi Alex,
Upon closer examination, the GTF file generated from GFF was in fact not too bad... The exception is a few lines in the beginning, where for some genes there were no "exon" entries in the GFF file already. (Those few first lines is what I noticed first, and decided that "ggfread" did not work. I apologize for not checking it earlier.)
Interestingly, these "failed" lines at the top of the file seem to come from the mitochondrial genome.
Example of such gene in the GFF file:
grep "=ND1;" GCF_000001515.7_Pan_tro_3.0_genomic.gff
NC_001643.1 RefSeq gene 2725 3681 . + . ID=gene39686;Dbxref=GeneID:807867;Name=ND1;gbkey=Gene;gene=ND1;gene_biotype=protein_coding;partial=true;start_range=.,2725
NC_001643.1 RefSeq CDS 2725 3681 . + 0 ID=cds80154;Parent=gene39686;Dbxref=Genbank:NP_008186.1,GeneID:807867;Name=NP_008186.1;gbkey=CDS;gene=ND1;partial=true;product=NADH dehydrogenase subunit 1;protein_id=NP_008186.1;start_range=.,2725;transl_table=2
As you can see, there is no "exon" entry for the gene.
Here the NC_001643.1 is in fact the NCBI notation for chrM. (was not obvious!). Apparently, the entries for several mitochondrial genes are not called "exons" in this GFF, which sort of makes sense, I guess. And that's how they end up not having a correct "exon" entry in the GTF file, though they have a CDS entry.
Apart from these cases, the rest of the GTF file seems legit. I will double-check and might try to correct the missing entries, as you suggested.
In the meantime, I've also generated the index with this GTF file, using the option:
c
grep "VWA1" GCF_000001515.7_Pan_tro_3.0_genomic.gff | grep "630453"
NC_006468.4 Gnomon gene 630453 637801 . + . ID=gene47;Dbxref=GeneID:745278;Name=VWA1;gbkey=Gene;gene=VWA1;gene_biotype=protein_coding
NC_006468.4 Gnomon mRNA 630453 637801 . + . ID=rna115;Parent=gene47;Dbxref=GeneID:745278,Genbank:XM_016952138.1;Name=XM_016952138.1;gbkey=mRNA;gene=VWA1;model_evidence=Supporting evidence includes similarity to: 6 mRNAs%2C 751 ESTs%2C 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 55 samples with support for all annotated introns;product=von Willebrand factor A domain containing 1%2C transcript variant X1;transcript_id=XM_016952138.1
NC_006468.4 Gnomon exon 630453 630746 . + . ID=id1048;Parent=rna115;Dbxref=GeneID:745278,Genbank:XM_016952138.1;gbkey=mRNA;gene=VWA1;product=von Willebrand factor A domain containing 1%2C transcript variant X1;transcript_id=XM_016952138.1
NC_006468.4 Gnomon mRNA 630453 637801 . + . ID=rna116;Parent=gene47;Dbxref=GeneID:745278,Genbank:XM_016952145.1;Name=XM_016952145.1;gbkey=mRNA;gene=VWA1;model_evidence=Supporting evidence includes similarity to: 2 mRNAs%2C 442 ESTs%2C 1 Protein%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments;product=von Willebrand factor A domain containing 1%2C transcript variant X2;transcript_id=XM_016952145.1
NC_006468.4 Gnomon exon 630453 630746 . + . ID=id1051;Parent=rna116;Dbxref=GeneID:745278,Genbank:XM_016952145.1;gbkey=mRNA;gene=VWA1;product=von Willebrand factor A domain containing 1%2C transcript variant X2;transcript_id=XM_016952145.1
grep "VWA1" GCF_000001515.7_Pan_tro_3.0_genomic.gtf | grep "630453"
NC_006468.4 Gnomon exon 630453 630746 . + . transcript_id "rna115"; gene_id "gene47"; gene_name "VWA1";
NC_006468.4 Gnomon exon 630453 630746 . + . transcript_id "rna116"; gene_id "gene47"; gene_name "VWA1";