bamqc duplication rate

926 views
Skip to first unread message

Vlad Saveliev

unread,
Jan 14, 2016, 1:10:00 PM1/14/16
to qual...@googlegroups.com
Hi Konstantin,

I used Qualimap 2.1.3 to analyse a BAM-file from a deep sequencing assay, with pre-marked duplicates (using biobambam2). The raw average target coverage was 882x, and I expected the coverage by unique reads it to be about ~600x. I ran Qualimap with --skip-duplicated flag:

qualimap_v2.1.1/qualimap bamqc --skip-duplicated -nt 1 --java-mem-size=24G -nr 5000 -bam ... -outdir ... -gff ... -c -gd HUMAN

and was very surprized to get 47.83x average coverage inside of regions. In fact, the report looked like the BAM file was hard-clipped by 50x - there were no site in the target with coverage depth above 50x: 
Inline image 1
The duplication rate (flagged) was only 20.76%, and the percentage reads mapped off target was about 37%, so the value 47.83x of mean coverage was completely unexpected. I did some calculations manually on a file with hard-removed duplicates, and the actual average coverage I got was ~650x. I was afraid Qualimap may have done some duplication estimation on its own, but the option --skip-duplicated guaranties that if the duplicates are already marked, it would not happen.

Then I downloaded the latest development version (qualimap-build-03-11-15.tar.gz) in order to try the new --skip-dup-mode option:

qualimap-build-13-01-16/qualimap bamqc --skip-duplicated --skip-dup-mode 1 -nt 1 --java-mem-size=24G -nr 5000 -bam ... -outdir ... -gff ... -c -gd HUMAN

and now the result I got was 674.36x average coverage inside of regions, which is what would I expect.

But when I look at the report, I see the following values:
Mapped reads 89,626,782 / 99.57%
Mapped reads (inside of regions) 57,734,229 / 64.14%
Duplicated reads (flagged) 13,563,584 / 15.07%
Duplicated reads (estimated) 40,789,993 / 45.32% 
Duplication rate 97.02%

The 3rd value is supposed to be the duplication rate according to SAM flags. It is different from what was in the previous version (20.76%), and it now seems to be calculated as "the number of duplicated reads inside the regions" divided by the "total number of reads". It doesn't seem to reflect the real duplication happening in the sample, because all off-target reads here are just considered unique. To make it apples-to-apples, I would divide either on-target by on-target or off-target by off-target.

The second value is an attempt of Qualimap to calculate dup rate in the remaining unique reads (which I was sure I completely turned off by the options...). As far as I understand, Qualimap here estimates dup rate only in those reads that are not already flagged as duplicates, so the total dup rate would be 15.07%+45.32% = 60.39%.  

And the last value, 97.02%, is the one I completely don't understand, and I suspect it has something to do with the 47.83x average coverage in the previous case. If I divide the "total number of duplicated reads on target" by "the total number of reads on target", I get (40789993+13563584)/57734229 = 94.14%, which is still not 97.02. But anyway, I expected Qualimap to take only flagged duplicates into account. The intrinsic algorithms reports the values very far from what other tools like Picard, samblaster, biobambam report.

To summarize, I have 2 questions:
1. What may have caused the hard clipping at 50x and 48.83x reported average coverage? I used Qualimap for hundreds production projects, and now I'm afraid that it may affected them in a less obvious way.
2. What is "Duplication rate 97.02%"?
3. How can I prevent any attempt of Qualimap to detect duplicates by its own means?

-
Vlad

Konstantin Okonechnikov

unread,
Jan 15, 2016, 6:32:12 AM1/15/16
to qual...@googlegroups.com
Hi Vlad,

thanks for the detailed investigation and questions! 

Here are my answers:

1. Most likely, the main reason that leads to high increase in computed duplicates is the ignorance of paired-end reads, which is applied in the Qualimap estimation mode. Basically, Qualimap mode ignores the pair and considers all alignments as separate.There is actually an issue related to this: https://bitbucket.org/kokonech/qualimap/issues/7/bam-qc-add-support-for-paired-end-reads 

2.  Duplication rate is a value which is based on the results of duplication rate histogram (http://rawgit.com/kokonech/kokonech.github.io/master/qualimap/HG00096.chr20_bamqc/qualimapReport.html#genome_uniq_read_starts_histogram.png). This histogram shows a number of locations in the reference where only 1, 2, 3,... alignments start. 

The duplication rate is computed in the following way:

D.R. = ( 1 - ( number_of_locations_where_1_alignment_starts / total_number_of_start_locations ) ) * 100

Basically, if all alignments are unique , the value will be zero. The opposite condition (too many duplicates) leads to 1. This is actually the case in your example - a lot of duplicates were detected by Qualimap. Once again, the analysis is focused only on single-end reads, pairs are ignored.

3. The option -sdmode 1 is exactly what is required: only flagged duplicates will be skipped in this case. Even though other duplicates are computed and reported, they are not skipped. 

Also, as you mention, the percentage of duplicates is computed in comparison to the total number of mapped reads. This will be fixed.

Let me know if there any other questions.

--
  Konstantin 




--
You received this message because you are subscribed to the Google Groups "QualiMap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vlad Saveliev

unread,
Jan 15, 2016, 7:58:44 AM1/15/16
to qual...@googlegroups.com
Thank you Konstantin, that's clear. Not taking paired reads into account is clearly an issue. 

I think I misunderstood that statement in the documentation: "If the duplicates are not flagged in BAM file, then they will be detected by Qualimap.". I thought it means that if at least one read is already marked as duplicated in the BAM file, Qualimap will assume that a de-duplication software was already ran with this sample, and therefore Qualimap will not attempt to detect duplicates additionally. But in fact, it will still try to detect duplicates in the remaining reads.

So, in my case, the duplicates were already marked in the SAM flag, but Qualimap still used its own algorithm to detect duplicates, and in result dropped 95% of reads and reported 49x mean coverage, instead of expected ~650x.

Until Qualimap's algorithm starts to consider paired reads, I would suggest turning off duplicates detection by default, unless asked explicitly. So, with the --skip-duplicates flag on, I would check only SAM flags. In other words, I would put --sd-mode into production and make it 1 by default. Otherwise, it's very confusing.

--
Vlad

Konstantin Okonechnikov

unread,
Jan 19, 2016, 8:27:33 AM1/19/16
to qual...@googlegroups.com
Hi Vlad,

I performed a small redesign of skip duplicates mode functionality. 

Importantly, the order of skip-dup-mode option rate was changed. Now it is the following:
 0 : only flagged duplicates (default)
 1 : only estimated by Qualimap
 2 : both flagged and estimated

As you suggested, skipping only flagged duplicates is the default one. The order of options was changed while it makes more sense to have the zero value as the default. Moreover, If skip-duplicates option is activated in the default mode but flagged duplicates are not found, special warning message is provided in Qualimap report. 

Additionally, if the duplicates are marked in the BAM file only these duplicates are reported. The estimated duplicates are reported if there are no marked duplicates in the BAM file and if the second or third mode are selected for skipping the duplicates . 

Finally, the percentage of duplicate reads in the regions is computed based on the number of mapped reads only in the regions.

The novel version with changes is available here:

Would be great if you could check it and provide your comments.

--
  Konstantin






Vlad Saveliev

unread,
Jan 19, 2016, 4:56:23 PM1/19/16
to qual...@googlegroups.com
Thanks a lot Konstantin! That's really great. I ran the new version and the stats look correct. Thanks so much for your effort.

xiaol...@gmail.com

unread,
Aug 22, 2020, 1:27:16 PM8/22/20
to QualiMap
Hi, Dr. Okonechnikov,

Why the cutoff of the x-axis of duplication rates histogram is 50, could it be more than 50?
Thanks,
Xiao

Konstantin Okonechnikov

unread,
Aug 24, 2020, 7:50:44 AM8/24/20
to qual...@googlegroups.com
The default x-axis limit value was adjusted based on some general statistics from typical alignments. Is it too small and does not fit the limits for your case? what type of data/reference if this?

Ideally it  would be great to control this as an additional option, but it's rather minor. Therefore it could be included in some kind of part of configuration file as an additional list of points. I included it as a part of an issue:

Alternatively, simply check the maximum obtained value and if more than default 50, and in this case change to maximum found. 

Best regards,
   Konstnatin

Xiao Lei

unread,
Aug 25, 2020, 8:13:03 AM8/25/20
to qual...@googlegroups.com
Hi, Konstantin,

It is CLIP-seq data, with the HIV virus reference genome. The x-axis in the duplication histogram in Qualimap is too small for my case (please see my attached image).  As you can see, the data show an unusual high duplication rate, and if seems the x-axis could be extended to beyond 50. 
CLIP17_GCNpII_NCp15_cis duplication Qualimap.png

Thanks,
Xiao

Konstantin Okonechnikov

unread,
Aug 31, 2020, 11:54:52 AM8/31/20
to qual...@googlegroups.com
Hi Xiao,

two novel options are integrated into Qualimap: control cut limit for focused coverage histogram and duplication rate histogram. These options are available both in GUI and via command line. Here's the link to the version which includes these options:

Would be great if you could test it in your case and report if there are any issues. 

Best regards,
   Konstantin
 

Xiao Lei

unread,
Aug 31, 2020, 1:31:15 PM8/31/20
to qual...@googlegroups.com
Hi, Konstantin,

I really appreciate your effort to upgrade Qualimap! I just tested the duplication histogram with max 100 and 200 cut of the same data as my previous email. I attach my results here. It seems working but you can see that the last bin in the x-axis
CLIP17BC6_GNCpII_NC_dupcut150.png
CLIP17BC6_GNCpII_NC_dupcut100.png
 is always much higher than the rest, I am not sure if this is expected.

Best,
Xiao

Konstantin Okonechnikov

unread,
Sep 1, 2020, 8:34:07 AM9/1/20
to qual...@googlegroups.com
Hi Xiao,

Actually I am not surprised that for virus sequencing the count of duplicates is so high, since its sequence is really small and read counts are large. At least something similar I observed in other virus studies (e.g. H1N1).  I would simply suggest to increase the duplication rate limit up to 2000 and check the distribution. 

Best regards,
   Konstantin

xiaol...@gmail.com

unread,
Sep 1, 2020, 11:25:43 AM9/1/20
to QualiMap
Hi, Konstantin,

Thanks. I tried to increase the limit to 2000 as you suggested. It still not reach to limit and the histogram is not displayed correctly (see the x-axis). I also attach my result if I increase my limit up to 1000. 

image links:



Best,
Xiao

Reply all
Reply to author
Forward
0 new messages