Hi Alex,
Thanks for your reply. Sorry about the delay—I did not realize your responded. I believe Picard MarkDuplicates only uses primary alignments (i.e. ignores supplementary/secondary alignments). So you are right, it's not considering all the possible alignments when it marks duplicates.
And thanks for the suggestion. That approach makes sense, but we want to avoid removing reads from the data if we can help it.
I dug into this deeper and found that about 50% of the difference between the genome and transcriptome bams are explained by the fact that the duplicate reads in the genome are dropped in the transcriptome. For example, a read with the query name A is present in the genome as a duplicate, but I cannot find it in the transcriptome, not even as an unmapped read. Is this the expected behavior? I'm using STAR 2.6.1 with the following command line arguments:
STAR --readFilesIn ~{bam} --readFilesType SAM PE --readFilesCommand samtools view -h \
--runMode alignReads --genomeDir star_index --outSAMtype BAM Unsorted --runThreadN 8 \
--limitSjdbInsertNsj 1200000 --outSAMstrandField intronMotif --outSAMunmapped Within \
--outFilterType BySJout --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 \
--outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 \
--alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 --alignSoftClipAtReferenceEnds Yes --chimSegmentMin 15 --chimMainSegmentMultNmax 1 \
--chimOutType WithinBAM SoftClip --chimOutJunctionFormat 0 --twopassMode Basic --quantMode TranscriptomeSAM --quantTranscriptomeBan Singleend
I'd be more than happy to provide more information. Thank you for your help.
Best,
Takuto