Different genome coverage and Stdev results between qualimap and samtools depth

759 views
Skip to first unread message

Sagi Polani

unread,
Oct 24, 2013, 9:55:31 AM10/24/13
to qual...@googlegroups.com
Hi All,

I've been attempting to run qualimap and samtools depth on my WGS bam file in order to calculate both mean coverage and Stdev of my whole genome data (Illumina paired-end reads).

For qualimap, I've been running:

$ qualimap bamqc -bam [input.bam] -c

for Samtools, I've been running the following (based on this post: http://www.biostars.org/p/5165/)

$ samtools depth [input.bam]  |  awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}'


Here are the results:

Coverage             Coverage                Stdev               Stdev
(qualimap)           (samtools)            (qualimap)        (samtools)

   11.03                   11.44                  115.05              24.59


It seems that there is a major difference between the two tools in regards to the Stdev.

I'd appreciate your thoughts about this.

Thanks!

Sagi


Konstantin Okonechnikov

unread,
Oct 29, 2013, 10:26:42 AM10/29/13
to qual...@googlegroups.com
Hi Sagi,

thanks for your report.

One possible source of the difference is the normalization.

In the command, that you use to calculate coverage,  NR (number of records) is applied to normalize the coverage. However samtools depth doesn't output any lines for regions with no coverage. That means NR does not equal to the actual length of the reference. Qualimap uses the size of the reference as given in the BAM file (or as calculated from given regions).

The length of the reference can be calculated from the header of the BAM file:

samtools view -H input.bam | grep @SQ | cut -d ':' -f 3 | awk '{size+=$1}END{print "Ref_size = ",size}'

This length can be then inserted in your command:

samtools depth input.bam  |  awk 'BEGIN{rs=REF_SIZE;}{sum+=$3;sumsq+=$3*$3} END{ print "Ref_size =",rs; print "Average = ",sum/rs; print "Stdev = ", sqrt(sumsq/rs - (sum/rs)*(sum/rs))}'
 
Could you please check using the commands above if there is still large discrepancy in the results between samtools and Qualimap?

--
 Konstantin




--
You received this message because you are subscribed to the Google Groups "QualiMap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sagi Polani

unread,
Nov 1, 2013, 7:14:35 AM11/1/13
to qual...@googlegroups.com, k.okone...@gmail.com
Ho Konstantin,

Thanks for your reply.

I attempted your commands - Still there's a major difference... :(

Here are all the outputs:

Coverage             Coverage                        Coverage                           Stdev               Stdev                      Stdev
(qualimap)           (samtools)             (Konstantin Command)             (qualimap)        (samtools)     (Konstantin Command)

   11.03                   11.44                             10.89                              115.05              24.59                      24.11

I'd appreciate your thoughts...

Thanks!

Sagi

Konstantin Okonechnikov

unread,
Nov 11, 2013, 8:36:09 AM11/11/13
to qual...@googlegroups.com
Hi Sagi,

I think I found what might be the source of the discrepancy in the standard deviation.

Samtools discards by default all reads that are marked as PCR duplicates (Samtools FLAG  0x400, check the format specification for details), while Qualimap takes them into account when calculating coverage.

Try removing all duplicates from the BAM file and running Qualimap on the filtered file. The filtering can be performed using samtools:

samtools view -h -F 0x400 input.bam | samtools view -Sb - > input.filtered.bam

If the discrepancy still remains, maybe you could share a small subset of the your file, so I can investigate it in detail. You can send the subset to my institutional email.

Btw, I also compared the coverage output with bedGenomeCoverage tool from bedtools package in my experiments, this could be an option for you to try.

The command is similar to the one you used for samtools depth:

genomeCoverageBed -ibam input.bam -g genome.file -d | awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)*(sum/NR))}'

Hope that helps,
   Konstantin







Reply all
Reply to author
Forward
0 new messages