--SJRemoveDuplicatesType - arguably a missing option from STAR ???

50 views
Skip to first unread message

Malcolm Cook

unread,
Jan 25, 2018, 3:02:49 AM1/25/18
to rna-star
Hello,

I understand that running STAR a 2nd time as

    STAR --runMode inputAlignmentsFromBAM --bamRemoveDuplicatesType  UniqueIdentical  --outWigType bedGraph --outWigNorm RPM ...

yields RPM normalized wig files that discount PCR/optical duplicates in the RPM calculations going into wig production.

I am looking for a way to similarly produce SJ.out.tab files that do not count duplicate reads, but, alas by my reading of your manual this is not possible.

I would love to be mistaken.  Am I?

This is important to me as I am studying differential intron retention in fly with 101bp paired-end reads, and am using a strategy that assumes the reads contributing to SJ counts be the same as those contributing to coverage.

Thanks,

~Malcolm Cook

Alexander Dobin

unread,
Jan 26, 2018, 12:51:25 PM1/26/18
to rna-star
Hi Malcolm,

at the present, the removal of duplicates requires multiple steps.
1. The --bamRemoveDuplicatesType  UniqueIdentical or UniqueIdenticalNotMulti does not actually remove them from the BAM file, but rather marks them 0x400 bit in the SAM FLAG. 
2. --bamRemoveDuplicatesType and --outWigType cannot be used simultaneously in one run.

So, to generate the signal files without duplicates:
$ STAR --runMode inputAlignmentsFromBAM --bamRemoveDuplicatesType  UniqueIdenticalNotMulti --inputBAMfile Aligned.soretedByCoordinate.bam
This will generate Processed.out.bam. Note that UniqueIdentical would also mark all duplicates as multimappers, so in the end you will not see the multimapping signal at all)
$ samtools view -b -F0x400 Processed.out.bam > Processed.out.noDupl.bam
$ STAR --runMode inputAlignmentsFromBAM --outWigType bedGraph --outWigNorm RPM --inputBAMfile Processed.out.noDupl.bam

If you want to generate the counts for splice junctions after removing duplicates, you can use the script extras/scripts/sjFromSAMcollapseUandM.awk from the STAR distribution:
$ samtools view Processed.out.noDupl.bam | awk -f extras/scripts/sjFromSAMcollapseUandM.awk > SJ.out.noDupl.tab
The output format:
chr start end Nunique Nmultiple

Cheers
Alex

Malcolm Cook

unread,
Jan 28, 2018, 12:22:35 AM1/28/18
to rna-star
Thanks very much for the direction.  Most of it makes good sense, and I'm much obliged!

Do I understand correctly that bamRemoveDuplicatesType requires that in inputBAMfile must be sorted and that the resulting of  Processed.out.bam  and  Processed.out.noDupl.bam will in turn also be sorted.  It makes sense that they would, but, the manual does not say either way.  I suppose if inputAlignmentsFromBAM were multi-threaded there would be the opportunity for these output to not preserve input sort order.  Is this all correct?

Also. a different by related question...

What if I want to generate ReadsPerGene.tab after removing duplicates?  I seems I would need to redo alignReads with the reads in Processed.out.noDupl.bam?  Is this correct?

Thanks for STAR,

Malcolm


Alexander Dobin

unread,
Jan 29, 2018, 9:53:08 AM1/29/18
to rna-star
Hi Malcolm,

--bamRemoveDuplicatesType is not supposed to change the order of the reads, it just marks the duplicates. After you remove the duplicates, the BAM should still be sorted.
As always, it's prudent to check this is what actually happens. If you index the files with samtools index, I think it will complain if it finds them unsorted.

At the moment, the duplicate removal uses one thread only, which, as you said, makes it simple to preserve the sorting.
If I add multithreading, I think I will make the algorithm that guarantees the output remains sorted.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages