Duplication rates in Transcriptome-aligned vs Genome-aligned bams

Takuto Sato

unread,

Sep 16, 2021, 5:07:28 PM9/16/21

to rna-star

Hello,

I am comparing the duplication rates of the transcriptome-aligned vs genome-aligned bams using Picard MarkDuplicates. I have found that the genome-aligned bam reports much higher (30%) duplication rate compared to the transcriptome-aligned bam (~10%).

Some of the difference is explained, I think, by the fact that transcriptome bams do not contain unpaired reads. In general, the transcriptome bams seem to be droppiong some of the original reads, while they are present in the genome bams, so the denominator in (duplication rate) = (duplicate reads)/(total reads) is very different between the genome- and transcriptome-aligned bams.

But technicality aside, one would expect duplication rate to be the same, whether the reads are aligned to the transcriptome or the genome. I was wondering if anyone has an insight into this.

Best,

Takuto

Alexander Dobin

unread,

Sep 24, 2021, 6:13:39 PM9/24/21

to rna-star

Hi Takuto,

I am not sure how Picard MarkDuplicates is dealing with multimapping reads - it's not straightforward.

I suspect it may not be considering them at all.

If that's the case, since most reads in the transcriptome space are multimappers (even those that were unique mappers to the genome), they will not be marked as duplicated (even if they were marked in the genome space).

I guess the workaround would be to mark them in the genome space, remove them from BAM, and then remap the remaining reads with transcriptomic conversion - you can use BAM as input for STAR.

Cheers

Alex

Takuto Sato

unread,

Oct 24, 2021, 8:30:20 PM10/24/21

to rna-star

Hi Alex,

Thanks for your reply. Sorry about the delay—I did not realize your responded. I believe Picard MarkDuplicates only uses primary alignments (i.e. ignores supplementary/secondary alignments). So you are right, it's not considering all the possible alignments when it marks duplicates.

And thanks for the suggestion. That approach makes sense, but we want to avoid removing reads from the data if we can help it.

I dug into this deeper and found that about 50% of the difference between the genome and transcriptome bams are explained by the fact that the duplicate reads in the genome are dropped in the transcriptome. For example, a read with the query name A is present in the genome as a duplicate, but I cannot find it in the transcriptome, not even as an unmapped read. Is this the expected behavior? I'm using STAR 2.6.1 with the following command line arguments:

STAR --readFilesIn ~{bam} --readFilesType SAM PE --readFilesCommand samtools view -h \

--runMode alignReads --genomeDir star_index --outSAMtype BAM Unsorted --runThreadN 8 \

--limitSjdbInsertNsj 1200000 --outSAMstrandField intronMotif --outSAMunmapped Within \

--outFilterType BySJout --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 \

--outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 \

--alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 \

--alignSJDBoverhangMin 1 --alignSoftClipAtReferenceEnds Yes --chimSegmentMin 15 --chimMainSegmentMultNmax 1 \

--chimOutType WithinBAM SoftClip --chimOutJunctionFormat 0 --twopassMode Basic --quantMode TranscriptomeSAM --quantTranscriptomeBan Singleend

I'd be more than happy to provide more information. Thank you for your help.

Best,

Takuto

Alexander Dobin

unread,

Oct 25, 2021, 2:09:51 PM10/25/21

to rna-star

Hi Takuto,

most of STAR discussions moved to GitHub, which also gives better notifications...

https://github.com/alexdobin/STAR/issues

If an alignment is present in genomic BAM, but not in the Transcriptomic one, it means that the mapped to the genome, but is not concordant with any annotated transcripts.

They are not output as "unmapped reads", because they are not truly "unmapped". This was important for RSEM error model.

Cheers

Alex

Takuto Sato

unread,

Oct 25, 2021, 5:42:56 PM10/25/21

to rna-star

Hi Alex,

That makes sense. Thanks for the explanation. And I will make sure to post on github when I have more questions about STAR.