Hi Konstantin,
I used Qualimap 2.1.3 to analyse a BAM-file from a deep sequencing assay, with pre-marked duplicates (using biobambam2). The raw average target coverage was 882x, and I expected the coverage by unique reads it to be about ~600x. I ran Qualimap with --skip-duplicated flag:
qualimap_v2.1.1/qualimap bamqc --skip-duplicated -nt 1 --java-mem-size=24G -nr 5000 -bam ... -outdir ... -gff ... -c -gd HUMAN
and was very surprized to get 47.83x average coverage inside of regions. In fact, the report looked like the BAM file was hard-clipped by 50x - there were no site in the target with coverage depth above 50x:
The duplication rate (flagged) was only 20.76%, and the percentage reads mapped off target was about 37%, so the value 47.83x of mean coverage was completely unexpected. I did some calculations manually on a file with hard-removed duplicates, and the actual average coverage I got was ~650x. I was afraid Qualimap may have done some duplication estimation on its own, but the option --skip-duplicated guaranties that if the duplicates are already marked, it would not happen.
Then I downloaded the latest development version (qualimap-build-03-11-15.tar.gz) in order to try the new --skip-dup-mode option:
qualimap-build-13-01-16/qualimap bamqc --skip-duplicated --skip-dup-mode 1 -nt 1 --java-mem-size=24G -nr 5000 -bam ... -outdir ... -gff ... -c -gd HUMAN
and now the result I got was 674.36x average coverage inside of regions, which is what would I expect.
But when I look at the report, I see the following values:
Mapped reads 89,626,782 / 99.57%
Mapped reads (inside of regions) 57,734,229 / 64.14%
Duplicated reads (flagged) 13,563,584 / 15.07%
Duplicated reads (estimated) 40,789,993 / 45.32%
The 3rd value is supposed to be the duplication rate according to SAM flags. It is different from what was in the previous version (20.76%), and it now seems to be calculated as "the number of duplicated reads inside the regions" divided by the "total number of reads". It doesn't seem to reflect the real duplication happening in the sample, because all off-target reads here are just considered unique. To make it apples-to-apples, I would divide either on-target by on-target or off-target by off-target.
The second value is an attempt of Qualimap to calculate dup rate in the remaining unique reads (which I was sure I completely turned off by the options...). As far as I understand, Qualimap here estimates dup rate only in those reads that are not already flagged as duplicates, so the total dup rate would be 15.07%+45.32% = 60.39%.
And the last value, 97.02%, is the one I completely don't understand, and I suspect it has something to do with the 47.83x average coverage in the previous case. If I divide the "total number of duplicated reads on target" by "the total number of reads on target", I get (40789993+13563584)/57734229 = 94.14%, which is still not 97.02. But anyway, I expected Qualimap to take only flagged duplicates into account. The intrinsic algorithms reports the values very far from what other tools like Picard, samblaster, biobambam report.
To summarize, I have 2 questions:
1. What may have caused the hard clipping at 50x and 48.83x reported average coverage? I used Qualimap for hundreds production projects, and now I'm afraid that it may affected them in a less obvious way.
2. What is "Duplication rate 97.02%"?
3. How can I prevent any attempt of Qualimap to detect duplicates by its own means?
-
Vlad