--SJRemoveDuplicatesType - arguably a missing option from STAR ???

Malcolm Cook

unread,

Jan 25, 2018, 3:02:49 AM1/25/18

to rna-star

Hello,

I understand that running STAR a 2nd time as

STAR --runMode inputAlignmentsFromBAM --bamRemoveDuplicatesType UniqueIdentical --outWigType bedGraph --outWigNorm RPM ...

yields RPM normalized wig files that discount PCR/optical duplicates in the RPM calculations going into wig production.

I am looking for a way to similarly produce SJ.out.tab files that do not count duplicate reads, but, alas by my reading of your manual this is not possible.

I would love to be mistaken. Am I?

This is important to me as I am studying differential intron retention in fly with 101bp paired-end reads, and am using a strategy that assumes the reads contributing to SJ counts be the same as those contributing to coverage.

Thanks,

~Malcolm Cook

Alexander Dobin

unread,

Jan 26, 2018, 12:51:25 PM1/26/18

to rna-star

Hi Malcolm,

at the present, the removal of duplicates requires multiple steps.

1. The --bamRemoveDuplicatesType UniqueIdentical or UniqueIdenticalNotMulti does not actually remove them from the BAM file, but rather marks them 0x400 bit in the SAM FLAG.

2. --bamRemoveDuplicatesType and --outWigType cannot be used simultaneously in one run.

So, to generate the signal files without duplicates:

$ STAR --runMode inputAlignmentsFromBAM --bamRemoveDuplicatesType UniqueIdenticalNotMulti --inputBAMfile Aligned.soretedByCoordinate.bam

This will generate Processed.out.bam. Note that UniqueIdentical would also mark all duplicates as multimappers, so in the end you will not see the multimapping signal at all)

$ samtools view -b -F0x400 Processed.out.bam > Processed.out.noDupl.bam

$ STAR --runMode inputAlignmentsFromBAM --outWigType bedGraph --outWigNorm RPM --inputBAMfile Processed.out.noDupl.bam

If you want to generate the counts for splice junctions after removing duplicates, you can use the script extras/scripts/sjFromSAMcollapseUandM.awk from the STAR distribution:

$ samtools view Processed.out.noDupl.bam | awk -f extras/scripts/sjFromSAMcollapseUandM.awk > SJ.out.noDupl.tab

The output format:

chr start end Nunique Nmultiple

Cheers

Alex

Malcolm Cook

unread,

Jan 28, 2018, 12:22:35 AM1/28/18

to rna-star

Thanks very much for the direction. Most of it makes good sense, and I'm much obliged!

Do I understand correctly that bamRemoveDuplicatesType requires that in inputBAMfile must be sorted and that the resulting of Processed.out.bam and Processed.out.noDupl.bam will in turn also be sorted. It makes sense that they would, but, the manual does not say either way. I suppose if inputAlignmentsFromBAM were multi-threaded there would be the opportunity for these output to not preserve input sort order. Is this all correct?

Also. a different by related question...

What if I want to generate ReadsPerGene.tab after removing duplicates? I seems I would need to redo alignReads with the reads in Processed.out.noDupl.bam? Is this correct?

Thanks for STAR,

Malcolm

Alexander Dobin

unread,

Jan 29, 2018, 9:53:08 AM1/29/18

to rna-star

Hi Malcolm,

--bamRemoveDuplicatesType is not supposed to change the order of the reads, it just marks the duplicates. After you remove the duplicates, the BAM should still be sorted.

As always, it's prudent to check this is what actually happens. If you index the files with samtools index, I think it will complain if it finds them unsorted.

At the moment, the duplicate removal uses one thread only, which, as you said, makes it simple to preserve the sorting.

If I add multithreading, I think I will make the algorithm that guarantees the output remains sorted.