STAR indexing with gff format

89 views
Skip to first unread message

Aggelos Kozonakis

unread,
Feb 6, 2023, 1:25:13 PM2/6/23
to rna-star
Hello, i recently started my bioinformatics courses and i am doing my first STAR indexing/maping.
I know the indexing takes a long time to finish but i wanted to check just in case that i have the parameters set properly. I am doing a whole human genome indexing.

GFF file downloaded from NCBI rsync server looks like this:
```
NC_000001.11    RefSeq  region  1       248956422       .       +       .       ID=NC_000001.11:1..248956422;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_000001.11    BestRefSeq      pseudogene      11874   14409   .       +       .       ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
NC_000001.11    BestRefSeq      transcript      11874   14409   .       +       .       ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11    BestRefSeq      exon    11874   12227   .       +       .       ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11    BestRefSeq      exon    12613   12721   .       +       .       ID=exon-NR_046018.2-2;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11    BestRefSeq      exon    13221   14409   .       +       .       ID=exon-NR_046018.2-3;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
```
My command for indexing:

STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ${2}.fna --sjdbGTFfile ${2}.gff --sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 100 --runThreadN 12              

And for maping, i am using paired end reads:

for map in $( cat patient.txt ); do
    echo "Mapping ${map}"
    STAR --genomeDir ./ --readFilesIn ${fastqdir}trimmed_${map}_1.fastq ${fastqdir}trimmed_${map}_2.fastq --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ${map}
done


My main concern is that i have set wrong parameters on the indexing since i am not 100% sure about contents of each column. 
Is the Parent in the gff file the column i should be using for the exon parent trascript parameter? 
Also which column here is considered "gene id"? I am asking since i need to use it for HTSeq and im torn between ID / Dbxref / Name /gene. 
Sorry for the basic questions and thanks in advance for your time and help!

Aggelos Kozonakis

unread,
Feb 6, 2023, 3:01:27 PM2/6/23
to rna-star
Ok so after a few hours it finished. My results using samtoolvs view is this:

ERR009159.3214333       355     NC_000001.11    13209   3       34M2S   =       183725  170553  GTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGAG    BBBBBBBA@ABB@AB<:7>@AAB@@@BBBBA><;7<    NH:i:2  HI
ERR009159.1419841       163     NC_000001.11    13482   1       1S37M   =       13529   83      GGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTT  ABBABBBBBB@>>BBAAABBBBBBBB>7@>;@=@>>6A  NH:i:3  HI
ERR009159.1419841       83      NC_000001.11    13529   1       36M     =       13482   -83     GCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAG    ===A@@@6A?@@ABBBBBBBBBBBBBBBB@B@@;@B    NH:i:3  HI
ERR009159.4650843       99      NC_000001.11    14362   0       2S34M   =       14384   59      TTTCCTGCACAGCTAGAGATCCTTTATTAAAAGCAC    DDDDDDDDDCCCDCCCDDDDDDDDCCCCCAA?AACC    NH:i:6  HI

Instead of chromosomes i have NC_00... is that problematic?

Alexander Dobin

unread,
Feb 6, 2023, 4:12:04 PM2/6/23
to rna-star
Hi,

For the human genome, I recommend using the FASTA and GTF from Gencode or Ensembl. They have conventional chromosome names and more comprehensive annotations.
Reply all
Reply to author
Forward
0 new messages