STAR indexing with gff format

Aggelos Kozonakis

unread,

Feb 6, 2023, 1:25:13 PM2/6/23

to rna-star

Hello, i recently started my bioinformatics courses and i am doing my first STAR indexing/maping.

I know the indexing takes a long time to finish but i wanted to check just in case that i have the parameters set properly. I am doing a whole human genome indexing.

GFF file downloaded from NCBI rsync server looks like this:

```

NC_000001.11 RefSeq region 1 248956422 . + . ID=NC_000001.11:1..248956422;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_000001.11 BestRefSeq pseudogene 11874 14409 . + . ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
NC_000001.11 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11 BestRefSeq exon 11874 12227 . + . ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11 BestRefSeq exon 12613 12721 . + . ID=exon-NR_046018.2-2;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11 BestRefSeq exon 13221 14409 . + . ID=exon-NR_046018.2-3;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2

```

My command for indexing:

STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ${2}.fna --sjdbGTFfile ${2}.gff --sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 100 --runThreadN 12

And for maping, i am using paired end reads:

for map in $( cat patient.txt ); do
echo "Mapping ${map}"
STAR --genomeDir ./ --readFilesIn ${fastqdir}trimmed_${map}_1.fastq ${fastqdir}trimmed_${map}_2.fastq --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ${map}
done

My main concern is that i have set wrong parameters on the indexing since i am not 100% sure about contents of each column.
Is the Parent in the gff file the column i should be using for the exon parent trascript parameter?
Also which column here is considered "gene id"? I am asking since i need to use it for HTSeq and im torn between ID / Dbxref / Name /gene.
Sorry for the basic questions and thanks in advance for your time and help!

Aggelos Kozonakis

unread,

Feb 6, 2023, 3:01:27 PM2/6/23

to rna-star

Ok so after a few hours it finished. My results using samtoolvs view is this:

ERR009159.3214333 355 NC_000001.11 13209 3 34M2S = 183725 170553 GTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGAG BBBBBBBA@ABB@AB<:7>@AAB@@@BBBBA><;7< NH:i:2 HI
ERR009159.1419841 163 NC_000001.11 13482 1 1S37M = 13529 83 GGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTT ABBABBBBBB@>>BBAAABBBBBBBB>7@>;@=@>>6A NH:i:3 HI
ERR009159.1419841 83 NC_000001.11 13529 1 36M = 13482 -83 GCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAG ===A@@@6A?@@ABBBBBBBBBBBBBBBB@B@@;@B NH:i:3 HI
ERR009159.4650843 99 NC_000001.11 14362 0 2S34M = 14384 59 TTTCCTGCACAGCTAGAGATCCTTTATTAAAAGCAC DDDDDDDDDCCCDCCCDDDDDDDDCCCCCAA?AACC NH:i:6 HI

Instead of chromosomes i have NC_00... is that problematic?

Alexander Dobin

unread,

Feb 6, 2023, 4:12:04 PM2/6/23

to rna-star

Hi,

For the human genome, I recommend using the FASTA and GTF from Gencode or Ensembl. They have conventional chromosome names and more comprehensive annotations.

Reply all

Reply to author

Forward