Understanding counts in markdup log

44 views

Skip to first unread message

Phil Ewels

unread,

Jun 23, 2021, 3:09:07 AM6/23/21

to sambamba-discussion

Hi there,

I'm the developer for MultiQC and I'm currently reviewing a PR to add a module to parse log outputs from sambamba markdup: https://github.com/ewels/MultiQC/pull/1421

The module code parses the log output from sambamba to calculate read counts and duplicate rates. For example:

finding positions of the duplicate reads in the file...

sorted 22892630 end pairs

and 205923 single ends (among them 682 unmatched pairs)

collecting indices of duplicate reads... done in 4177 ms

found 12288581 duplicates

There is then a bar plot that shows the distribution of reads. It's not totally clear exactly how these numbers sum, so I wanted to confirm that I'm getting this right:

Total read count = 45,805,853
- 2 * 22892630 read pairs + 205923 unpaired reads
12,288,581 single reads (not read pairs?)
So 26.8% duplicate reads?
- 12,288,581 / 45,805,853
I guess it's not possible to have information about single ends / unmatched pairs in this plot, as we don't know which of those end up in the duplicate reads category, so could be double-counting them...

Thanks in advance for any help / clarification. Happy to hear suggestions!

Cheers,

Phil

Reply all

Reply to author

Forward

0 new messages