Missing row data in ReadsPerGene.tab.out from RefSeq genome and annotation

73 views
Skip to first unread message

Tanner Grudda

unread,
Feb 7, 2024, 8:00:25 AMFeb 7
to rna-star
Hello,
Thank you for your continued interaction and troubleshooting friendliness I've had the chance to learn from. This is my first foray into sequencing analysis and I'm struggling to understand the ReadsPerGene.out.tab information despite searching the manual and online.

I downloaded the reference sequence (genD_NCBI_tg.fna) and annotation (spHBV_tg24_sp123_08.gtf) from NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000861825.2/) and renamed them for my purposes. I am mapping to the HBV genome (3.2kb circular genome with numbering beginning from the EcoRI cute site). During indexing, I've verified that the --sjdbOverhang is correct and adjusted the --genomeSAindesNbases to the genome size. During mapping, I adjusted the --outFilter scores as I will eventually be looking for rare splice variants and limited the --readMapNumber to facilitate fast processing neither of which should be giving me this problem. I have yet to fully understand the --outFilterMultimapNmax, but will return to it later.

The issue I'm facing is that I get counts for only 3/4 of the genes (HBVgp1, HBVgp2, HBVgp3, and HBVgp4). The GTF attribute of what is being shown on the ReadsPerGene.out.tab is actually the gene and not exon (not sure why). I had thought it may be because HBVgp2 has multiple overlapping CDS for the gene leading to them being thrown own at the counting step. However, HBVgp4 also has multiple overlapping CDS for the gene and those are registered by the count. Is there a reason I can't get counts for HBVgp2 considering I get counts for the other genes in the annotation? Is my understanding of the ReadsPerGene.out.tab lacking?

I've included the relevant STAR files below and attached the reference sequence and annnotations.

Thank you in advance for you help.
Tanner

ml STAR/2.7.10a
Star_index

STAR \

        --runThreadN 20 \

        --runMode genomeGenerate \

        --genomeDir /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08 \

        --genomeFastaFiles /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08/genD_NCBI_tg.fna \

        --sjdbGTFfile /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08/spHBV_tg24_sp123_08.gtf \

        --sjdbGTFtagExonParentTranscript transcript_id \

        --sjdbGTFtagExonParentGene gene_id \

        --sjdbGTFfeatureExon gene \

        --sjdbOverhang 89 \

        --genomeSAindexNbases 4 \

        --sjdbFileChrStartEnd /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08/sjdbFile_08.txt

 

 

Star_mapping

STAR --runThreadN 48 \

       --genomeDir /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08 \

       --readFilesIn /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/umi-tool-tg/HB6/HB6_R2_extracted_trimmed.fastq.gz \

       --readFilesCommand zcat \

       --outFilterMultimapNmax 30 \

       --outSAMtype BAM SortedByCoordinate \

       --quantMode TranscriptomeSAM GeneCounts \

       --limitBAMsortRAM 4041611632 \

       --outFilterScoreMinOverLread 0.1 \

       --outFilterMatchNminOverLread 0.1 \

       --outFileNamePrefix /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/mapped/aHb6/nicole.s/tgx/08/08_ \

       --readMapNumber 100000 \

 

echo "FIN"

 

08_ReadsPerGene

N_unmapped      28232   28232   28232

N_multimapping  51940   51940   51940

N_noFeature     0       9463    10365

N_ambiguous     11714   5501    6213

HBVgp3  814     489     325

HBVgp4  1665    529     1136

HBVgp1  5635    3846    1789

HBVgp2  0       0       0

genD_NCBI_tg.fna
spHBV_tg24_sp123_08.gtf

Alexander Dobin

unread,
Feb 23, 2024, 3:18:05 PMFeb 23
to rna-star
Hi Tanner,

I see two issues: 
1) The GTF file has empty transcript_id fields; 
2) gp2 overlaps gp1, and the reads that map to overlapping areas will be considered ambiguous and not counted.

Tanner Grudda

unread,
Mar 20, 2024, 2:17:39 PMMar 20
to rna-star
Ah, thank you, that makes sense. I'll give my data another look with this in mind. Thank you!
Reply all
Reply to author
Forward
0 new messages