Understanding counts in markdup log

44 views
Skip to first unread message

Phil Ewels

unread,
Jun 23, 2021, 3:09:07 AM6/23/21
to sambamba-discussion
Hi there,

I'm the developer for MultiQC and I'm currently reviewing a PR to add a module to parse log outputs from sambamba markdup: https://github.com/ewels/MultiQC/pull/1421

The module code parses the log output from sambamba to calculate read counts and duplicate rates. For example:

finding positions of the duplicate reads in the file...
  sorted 22892630 end pairs
     and 205923 single ends (among them 682 unmatched pairs)
  collecting indices of duplicate reads...   done in 4177 ms
  found 12288581 duplicates

There is then a bar plot that shows the distribution of reads. It's not totally clear exactly how these numbers sum, so I wanted to confirm that I'm getting this right:

  • Total read count = 45,805,853
    • 2 * 22892630 read pairs + 205923 unpaired reads
  • 12,288,581 single reads (not read pairs?)
  • So 26.8% duplicate reads?
    • 12,288,581 / 45,805,853
  • I guess it's not possible to have information about single ends / unmatched pairs in this plot, as we don't know which of those end up in the duplicate reads category, so could be double-counting them...
Thanks in advance for any help / clarification. Happy to hear suggestions!

Cheers,

Phil


Reply all
Reply to author
Forward
0 new messages