could I build STAR index using NCBI gff annotation

3,918 views
Skip to first unread message

xiangyu...@gmail.com

unread,
Sep 19, 2016, 1:11:58 PM9/19/16
to rna-star

      Hi,
              I wonder that whether STAR can use gff annotation from NCBI to build the reference fasta 's index and also the reference genome FASTA from NCBI.
              And if it can't ,could I transfer the GFF to GTF to satisfy the request of gene.id and transcript.id through my own python script.
             My command line is as follow:
            
STAR --runMode genomeGenerate --genomeDir /stor9000/apps/users/sheep_reference --genomeFastaFiles /stor9000/apps/users/sheep_reference/GCF_000298735.1_Oar_v3.1_genomic.fna --runThreadN 24 --sjdbGTFfile /stor9000/apps/users/sheep_reference/GCF_000298735.1_Oar_v3.1_genomic.gff --sjdbOverhang 99


              Look forward to your reply.
              Best!

              Xiangyu Pan

Alexander Dobin

unread,
Sep 21, 2016, 6:09:46 PM9/21/16
to rna-star
Hi Xiangyu

it is possible to use the gff file, however you would need to specify --sjdbGTFtagExonParentTranscript parameter, which is typically 'Parent' for gff files - this is the attribute that assigns exon to a transcript.
If you want to use --sjdbGTFtagExonParentTranscript GeneCounts option, the gff file would have to contain the gene attrbiute for each exon, specified in --sjdbGTFtagExonParentGene. Since typically this attribute is not present, it's best to convert the file to GTF with gene_id and transcript_id for each exons. You can use your own script, or gffread from Cufflinks package:
gffread -T small.gff3 -o small.gtf

Cheers
Alex

Marianna Pauletto

unread,
Jul 13, 2017, 2:55:36 PM7/13/17
to rna-star
Hi Alex,

I'm trying to do what you suggested but there is a problem with the ReadsPerGene.tab file which contains a list of genes like gene1, gene2, gene3, instead of the gene IDs.

I used gffread to convert the gff file from ncbi to gtf.
Then I run the mapping step doing like this:

STAR --runThreadN 16 --genomeDir /home/mpauletto/elab/GenomeDir/ --readFilesIn /home/mpauletto/elab/data/file.fastq.gz --readFilesCommand gunzip -c --outFileNamePrefix /home/mpauletto/elab/mapping/test_ --twopassMode Basic --outSAMtype None --outSAMmapqUnique 60 --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM GeneCounts --outFilterMismatchNmax 3

What's wrong??

Thank you in advance
Best
Marianna

Alexander Dobin

unread,
Jul 13, 2017, 3:33:18 PM7/13/17
to rna-star
Hi Marianna,

please send me the link to the NCBI GFF file you are using.

Cheers
Alex

Marianna Pauletto

unread,
Jul 28, 2017, 11:01:39 AM7/28/17
to rna-star

Marianna Pauletto

unread,
Jul 28, 2017, 12:34:02 PM7/28/17
to rna-star
Hi Alexander,

I think I figured out:
This is what I did:

First I converted the gff in gtf

then

STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir ./GenomeDir \
--genomeFastaFiles ./GCF_000633615.1_Guppy_female_1.0_MT_genomic.fna \
--sjdbGTFfile ./GCF_000633615.1_Guppy_female_1.0_MT_genomic.gtf \
--sjdbOverhang 49

then

STAR --runThreadN 16 --genomeDir /home/mpauletto/elab/guppy/GenomeDir/ --readFilesIn /home/data/${s}.fastq.gz \
--readFilesCommand gunzip -c --outFileNamePrefix /home/mapping/${s}_ \

--twopassMode Basic --outSAMtype None --outSAMmapqUnique 60 --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM GeneCounts --outFilterMismatchNmax 3

and this is the GeneCounts tab

N_unmapped      841453  841453  841453
N_multimapping  0       0       0
N_noFeature     1675927 16399215        1765998
N_ambiguous     74166   139     4650
gene0   7038    4280    2837
gene0   169     0       169
gene1   0       0       0
gene2   611     0       611
gene3   611     0       611
gene4   1       0       1
....

Sounds pretty good, since I suppose that now I need to extract the gene name and description of each gene from the gff file.
But why I have two gene0? It seems this is because the gff file has two genes labeled as "gene0". Isn't it?


Thank you
Best
Marianna

Alexander Dobin

unread,
Aug 1, 2017, 3:49:03 PM8/1/17
to rna-star
Hi Marianna,

I have run gffread gff-to-gtf conversion on this gff file, and the resulting gtf appears to have a few exon lines that do not have "gene_id" attribute.
This causes problems for STAR - I could reproduce the problem. After I removed these lines (e.g. $ grep gene_id Annot.gtf > Annot.gene_id.gtf), only one gene0 entry remains. Please try it out.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages