Help for Duplication rate

1,059 views
Skip to first unread message

yingya

unread,
Oct 31, 2012, 9:35:38 AM10/31/12
to qual...@googlegroups.com
Hi,
I am using QualiMap to facilitate the quality control of alignment sequencing data. The report tells me that the Duplication rate is 64.4%, then I used Picard to remove duplicates and
the Duplication rate is 62.41%. Is there something wrong when I used QualiMap or anything else I misunderstood?

Konstantin Okonechnikov

unread,
Oct 31, 2012, 12:16:31 PM10/31/12
to qual...@googlegroups.com
Hi,

duplication rate should be significantly lower after removing duplicates using Picard (in theory 0%). Could you please tell how do you run piccard command?

I have created an artificial SAM file with duplicate reads (see file duplicates.sam attached). In this file there are 2 duplicate reads and 4 non-duplicate reads. Plz take into account that by "duplication" in Qualimap we assume the following: reads are considered duplicates if they are aligned to the same genomic position. So, for this example out of 5 alignment start positions we have 4 unique and 1 non-unique (with 2 reads aligned to it).  Qualimap outputs duplication rate of 20%.

After running Qualimap, I applied Picard's MarkDuplicates program:

java -jar $PICARD/MarkDuplicates.jar I=duplicates.sam O=duplicates.clean.sam REMOVE_DUPLICATES=True METRICS_FILE=duplicates_report.txt

For new file duplicates.clean.sam Qualimap outputs duplication rate 0%.

It would be nice if you could test same example and report your results.

--
 Konstantin

Konstantin Okonechnikov

unread,
Oct 31, 2012, 12:18:14 PM10/31/12
to qual...@googlegroups.com
Sorry, forgot to attach the file, here it is.

--
 Konstantin
duplicates.sam

yingya

unread,
Nov 2, 2012, 1:20:58 AM11/2/12
to qual...@googlegroups.com, k.okone...@gmail.com
Hi Konstantin,

Thanks for your quckly respond. My picard command:
java -Xmx8G -jar picard-tools-1.72/MarkDuplicates.jar I=merge.bam O=dedup.bam M=metric TMP_DIR=. VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000 REMOVE_DUPLICATES=true ASSUME_SORTED=true
I think I know what is the question.My data is pair-end reads that picard will not remove unless both two reads are aligned to the same genomic position. But another question, I use "samtools rmdup -S merge.bam dedup.bam" to remove the duplication, duplication ratio still not 0%. Is QuliMap assume a multihit read as duplicate too?

Thanks

在 2012年11月1日星期四UTC+8上午12时16分32秒,Konstantin Okonechnikov写道:

Fernando Garcia

unread,
Nov 7, 2012, 6:17:20 AM11/7/12
to qual...@googlegroups.com
Hi,

No, Qualimap does not consider multihits as duplicates. As Konstantin outlined in his previous email, we account for duplicated read starts.

In principle, from what we understand of samtools rmdup functionality, by running samtools rmdup with -S (as you did) the duplication rate provided by Qualimap should be 0. It would be probably greater than 0 without the -S parameter, since we currently do not support paired-end information in BAM QC (we will in future releases).

Could you please check that the duplication rate for your data after running samtools rmdup should be actually 0? If it is the case, please let us know since very likely we have a bug in Qualimap.

Regards,
Fernando


--
Dr. Fernando Garcia
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin
GERMANY

E-Mail: gar...@mpiib-berlin.mpg.de
Telephone: +493028460426
Web   : www.mpiib-berlin.mpg.de 

lifuqiang

unread,
Nov 8, 2012, 5:01:23 AM11/8/12
to qualimap
Hi Fernando,
 
Here are some report of duplication rate  under different situation:
bam1:    9.76%
bam2:    8.29%
bam3:    0.78%
bam4:    0.72%
 
My data is pair-end reads. I used bwa-0.59 (aln -o 1 -q 10 -i 15 -t 4 -I ; sampe -a 600 ) to align to hg19.
bam1 : the raw result from bwa.
bam2 : result of  Picard MarkDuplicates
    java -jar MarkDuplicates.jar I=bam1 O=bam2 M=metrics1 ASSUME_SORTED=true REMOVE_DUPLICATES=true  
bam3: result of samtools rmdup
   samtools rmdup  -S  bam1  bam3
bam4: result of unique mapped reads
   bamtools filter -in bam3 -out bam4 -tag "X0=1"
 
Reagards,

Fuqiang Li

deep...@csirccmb.org

unread,
Feb 28, 2018, 12:10:50 AM2/28/18
to QualiMap
Jap_bbduk_trimmedreport.pdf shows 48.12% duplication rate.
After using rmdup (samtools), I still have 26% duplication rate as per Qualimap v.2.2.1.
samtools rmdup with -S, on completion of the job showed 170424041 / 411005180 = 0.4147 in library '    ' in stderr.
411005180 are the mapped reads, as per samtools flagstat and qualimap. So, I assume that rmdup removed 41.47% duplicates. Why does qualimap still show 26% duplicates? Please check the attached reports and revert as soon as possible.
Jap_bbduk_trimmedreport.pdf
S_rmdupreport.pdf

Konstantin Okonechnikov

unread,
Feb 28, 2018, 2:27:13 AM2/28/18
to qual...@googlegroups.com
HI!

The issue is that Qualimap focuses on each read alignment when duplications are estimated, while samtools take paired reads into account. Because of this there still might be presence of duplicates based on Qualimap estimation where only one read from a pair was detected to be duplicate while the second establishes the different location, therefore pair is ignored by samtools.  

Best regards,
   Konstantin



--
You received this message because you are subscribed to the Google Groups "QualiMap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Deepti Rao

unread,
Feb 28, 2018, 4:35:04 AM2/28/18
to qual...@googlegroups.com, Shubhankar Dutta
Dear Konstantin,

As mentioned, I used the option -S in my rmdup command. -S ensures that PE reads are treated as SE.

--
You received this message because you are subscribed to a topic in the Google Groups "QualiMap" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qualimap/NWotKAGCl8U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qualimap+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Regards,

Deepti
Graduate Student,
CCMB

Konstantin Okonechnikov

unread,
Feb 28, 2018, 4:42:45 AM2/28/18
to qual...@googlegroups.com
There might be other issues of this i.e. SNPs in reads that makes them non duplicates in samtools, while Qualimap uses alignment start. Could you perhaps share a random subsample from the BAM file without duplicates, so I can check it?


Also, which version of samtools did you use?

Best regards,
   Konstantin


Deepti Rao

unread,
Feb 28, 2018, 7:08:10 AM2/28/18
to qual...@googlegroups.com
Please find a subsample bam file attached.
Thanks for looking into the issue.​

Konstantin Okonechnikov

unread,
Mar 1, 2018, 9:06:09 AM3/1/18
to qual...@googlegroups.com
Hi! I tried to download the file, but did not get access. Sent a request using 2 different e-mails.

Best regards,
   Konstantin

Deepti Rao

unread,
Mar 5, 2018, 4:35:02 AM3/5/18
to qual...@googlegroups.com
Hi! Sorry for replying late. Please find the attached random bam file. Let me know if you need a bigger file. There was some issue in providing access to google drive. Thanks again!
second_random_rmdup_Jap.bam

Konstantin Okonechnikov

unread,
Mar 5, 2018, 10:53:58 AM3/5/18
to qual...@googlegroups.com
Hi! 

I checked the subsample but the duplication rate is 0, so probably larger file is required to figure out multiple duplicates report.

Also, it is named with BAM extension, however it's in text SAM format. It would be beneficial to make binary BAM file to decrease the size.

Best regards,
   Konstantin 

xiaol...@gmail.com

unread,
Aug 21, 2020, 9:40:01 PM8/21/20
to QualiMap
Hi, Dr.Okonechnikov,

You mentioned that " In Qualimap we assume the following: reads are considered duplicates if they are aligned to the same genomic position."  Does this mean Qualimap considers both the start and end position of reads in the alignment in the duplicate calculation?  I thought Qaulimap only consider the start position of the aligned reads in the duplicate calculation (i.e.reads with the same start position (disregard the end position) are considered duplicates. Am I right?


Best,
Xiao

Konstantin Okonechnikov

unread,
Aug 24, 2020, 7:24:16 AM8/24/20
to qual...@googlegroups.com
Hi,

yep, you're correct - only alignment start position is taken into account to measure the duplicates, sentence "...reads are considered duplicates if they are aligned to the same genomic position" is an easy way to explain this and correct if the read is fully aligned. 

Best regards,
   Konstantin

--
You received this message because you are subscribed to the Google Groups "QualiMap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/qualimap/48ccab25-3d27-4541-8327-e864467339ban%40googlegroups.com.

Keri Richards

unread,
Aug 24, 2020, 7:53:58 AM8/24/20
to qual...@googlegroups.com
So in theory this would mean if you had reads of different lengths but the same start position, it would be marked as a duplicate?

Xiao Lei

unread,
Aug 24, 2020, 10:32:14 AM8/24/20
to qual...@googlegroups.com
Hi, Keri,

Yes. Reads with the same start position (disregarding length) would be marked as duplicates. The duplication criteria in Qualimap is very stringent.  I wonder if Qualimap could collapse these duplicated reads into 1 and output a BAM file with duplicated reads collapsed.

Best,

Xiao

Konstantin Okonechnikov

unread,
Aug 31, 2020, 11:58:09 AM8/31/20
to qual...@googlegroups.com
Hi,

By its goal Qualimap is a tool focused only on reporting quality control information, but not editing the files. Here some other tools might be useful, e.g. Picard MarkDuplicates or samtools rmdup.

Best regards,
   Konstnatin

Xiao Lei

unread,
Aug 31, 2020, 1:07:01 PM8/31/20
to qual...@googlegroups.com
Hi, Konstnatin, 

Thanks for your suggestions. Is there any tool which can collapse duplicates under the same criteria (as long as the start site coordinates are the same) as Qualimap? You mentioned Picard MarkDuplicates or samtools rmdup, but these tools consider duplicates using different criteria. Am I right?

Best,
Xiao

Konstantin Okonechnikov

unread,
Sep 1, 2020, 8:37:00 AM9/1/20
to qual...@googlegroups.com
If I remember correctly for single-end reads samtools markdup should be similar to Qualimap. For paired-end reads the fragments are applied typically in both tools, but this might be also controlled. 


Xiao Lei

unread,
Sep 1, 2020, 11:29:19 AM9/1/20
to qual...@googlegroups.com
I see. Thanks a lot for your input.

Best,
Xiao

Reply all
Reply to author
Forward
0 new messages