Hi there,
The module code parses the log output from sambamba to calculate read counts and duplicate rates. For example:
finding positions of the duplicate reads in the file...
sorted 22892630 end pairs
and 205923 single ends (among them 682 unmatched pairs)
collecting indices of duplicate reads... done in 4177 ms
found 12288581 duplicates
There is then a bar plot that shows the distribution of reads. It's not totally clear exactly how these numbers sum, so I wanted to confirm that I'm getting this right:
- Total read count = 45,805,853
- 2 * 22892630 read pairs + 205923 unpaired reads
- 12,288,581 single reads (not read pairs?)
- So 26.8% duplicate reads?
- I guess it's not possible to have information about single ends / unmatched pairs in this plot, as we don't know which of those end up in the duplicate reads category, so could be double-counting them...
Thanks in advance for any help / clarification. Happy to hear suggestions!
Cheers,
Phil