Missing row data in ReadsPerGene.tab.out from RefSeq genome and annotation

Tanner Grudda

unread,

Feb 7, 2024, 8:00:25 AMFeb 7

to rna-star

Hello,

Thank you for your continued interaction and troubleshooting friendliness I've had the chance to learn from. This is my first foray into sequencing analysis and I'm struggling to understand the ReadsPerGene.out.tab information despite searching the manual and online.

I downloaded the reference sequence (genD_NCBI_tg.fna) and annotation (spHBV_tg24_sp123_08.gtf) from NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000861825.2/) and renamed them for my purposes. I am mapping to the HBV genome (3.2kb circular genome with numbering beginning from the EcoRI cute site). During indexing, I've verified that the --sjdbOverhang is correct and adjusted the --genomeSAindesNbases to the genome size. During mapping, I adjusted the --outFilter scores as I will eventually be looking for rare splice variants and limited the --readMapNumber to facilitate fast processing neither of which should be giving me this problem. I have yet to fully understand the --outFilterMultimapNmax, but will return to it later.

The issue I'm facing is that I get counts for only 3/4 of the genes (HBVgp1, HBVgp2, HBVgp3, and HBVgp4). The GTF attribute of what is being shown on the ReadsPerGene.out.tab is actually the gene and not exon (not sure why). I had thought it may be because HBVgp2 has multiple overlapping CDS for the gene leading to them being thrown own at the counting step. However, HBVgp4 also has multiple overlapping CDS for the gene and those are registered by the count. Is there a reason I can't get counts for HBVgp2 considering I get counts for the other genes in the annotation? Is my understanding of the ReadsPerGene.out.tab lacking?

I've included the relevant STAR files below and attached the reference sequence and annnotations.

Thank you in advance for you help.

Tanner

ml STAR/2.7.10a
Star_index

STAR \

--runThreadN 20 \

--runMode genomeGenerate \

--genomeDir /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08 \

--genomeFastaFiles /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08/genD_NCBI_tg.fna \

--sjdbGTFfile /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08/spHBV_tg24_sp123_08.gtf \

--sjdbGTFtagExonParentTranscript transcript_id \

--sjdbGTFtagExonParentGene gene_id \

--sjdbGTFfeatureExon gene \

--sjdbOverhang 89 \

--genomeSAindexNbases 4 \

--sjdbFileChrStartEnd /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08/sjdbFile_08.txt

Star_mapping

STAR --runThreadN 48 \

--genomeDir /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/sp.index/nicole.s/tgx/08 \

--readFilesIn /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/umi-tool-tg/HB6/HB6_R2_extracted_trimmed.fastq.gz \

--readFilesCommand zcat \

--outFilterMultimapNmax 30 \

--outSAMtype BAM SortedByCoordinate \

--quantMode TranscriptomeSAM GeneCounts \

--limitBAMsortRAM 4041611632 \

--outFilterScoreMinOverLread 0.1 \

--outFilterMatchNminOverLread 0.1 \

--outFileNamePrefix /home/tgrudda1/scr4-cthio1/10X-Visium/clo20_183119/outs/fastq_path/HG23MDRX2/sp.star/mapped/aHb6/nicole.s/tgx/08/08_ \

--readMapNumber 100000 \

echo "FIN"

08_ReadsPerGene

N_unmapped 28232 28232 28232

N_multimapping 51940 51940 51940

N_noFeature 0 9463 10365

N_ambiguous 11714 5501 6213

HBVgp3 814 489 325

HBVgp4 1665 529 1136

HBVgp1 5635 3846 1789

HBVgp2 0 0 0

genD_NCBI_tg.fna

spHBV_tg24_sp123_08.gtf

Alexander Dobin

unread,

Feb 23, 2024, 3:18:05 PMFeb 23

to rna-star

Hi Tanner,

I see two issues:

1) The GTF file has empty transcript_id fields;

2) gp2 overlaps gp1, and the reads that map to overlapping areas will be considered ambiguous and not counted.

Tanner Grudda

unread,

Mar 20, 2024, 2:17:39 PMMar 20

to rna-star

Ah, thank you, that makes sense. I'll give my data another look with this in mind. Thank you!

Reply all

Reply to author

Forward