Hello, i recently started my bioinformatics courses and i am doing my first STAR indexing/maping.
I know the indexing takes a long time to finish but i wanted to check just in case that i have the parameters set properly. I am doing a whole human genome indexing.
GFF file downloaded from NCBI rsync server looks like this:
```
NC_000001.11 RefSeq region 1 248956422 . + . ID=NC_000001.11:1..248956422;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_000001.11 BestRefSeq pseudogene 11874 14409 . + . ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
NC_000001.11 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11 BestRefSeq exon 11874 12227 . + . ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11 BestRefSeq exon 12613 12721 . + . ID=exon-NR_046018.2-2;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
NC_000001.11 BestRefSeq exon 13221 14409 . + . ID=exon-NR_046018.2-3;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
```
My command for indexing:
STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ${2}.fna --sjdbGTFfile ${2}.gff --sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 100 --runThreadN 12
And for maping, i am using paired end reads:
for map in $( cat patient.txt ); do
echo "Mapping ${map}"
STAR --genomeDir ./ --readFilesIn ${fastqdir}trimmed_${map}_1.fastq ${fastqdir}trimmed_${map}_2.fastq --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ${map}
done
My main concern is that i have set wrong parameters on the indexing since i am not 100% sure about contents of each column.
Is the Parent in the gff file the column i should be using for the exon parent trascript parameter?
Also which column here is considered "gene id"? I am asking since i need to use it for HTSeq and im torn between ID / Dbxref / Name /gene.
Sorry for the basic questions and thanks in advance for your time and help!