Problem with QualiMap coverage reports

1,127 views
Skip to first unread message

Eva König

unread,
Mar 28, 2013, 6:10:25 AM3/28/13
to qual...@googlegroups.com
Dear all,

I ran qualimap v0.6 bamqc on bam files of human whole-exome experiments. I provided a bed file with the my target exon regions via the -gff option (I call it exon.bed from now on). However, I believe that the QualiMap coverage report in the "Summary" section, subsection "Per chromosome statistics (inside of region)" is wrong. As an example, for chromosome 1, the following is reported:

Length: 249250621
Mapped bases: 79586776
Mean coverage: 3.66
Standard deviation: 11.94

I performed samtools mpileup only in the exon target regions of my exon.bed file, manually confirmed that the pileup was only generated in these regions and added zeros when samtools did not print a line at all in a target region (assuming that there is 0 coverage there). Then I took the 4th column from the pileup file (which gives the coverage at the given position) and computed some statistics. The mean coverage was 12.13.

These differences can be observed for many chromosomes, also for chrY, where the reported mean coverage is 0.02 and the true mean coverage is 6.29. This was actually what startled me in the first place, since the sample is male.

Do you have any explanation for these observations?

Thank you
Eva

Fernando Garcia

unread,
Mar 28, 2013, 11:51:01 AM3/28/13
to qual...@googlegroups.com
Hi Eva,

Thank you for your feedback.

We have tested the coverage calculation comparing qualimap output to bedtools coverageBed and we do not see the problem you describe. It could be that your samtools mpileup command is not computing the coverage as you expected or that we have an undetected bug in our method. Could you provide the commands you used? 

Attached you can find the files test.bamtest.bam.bai and test.bed used for this test. Here is how we tested our results:


So, at first we calculated the coverage using qualimap providing the bed file as an annotation:

qualimap bamqc -bam test.bam -gff test.bed

Here is an extract from the results:


Regions size/percentage of reference 317,355 / 1.36%

Name
Length
Mapped bases
Mean coverage
Standard deviation
MAL1
643292
5719973
17.88
163.06

After that we used coverageBed to calculate the coverage, with this command:

coverageBed -abam test.bam -b test.bed -d | awk '{c+=$8;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

mappedBases=5724205 RegionsSize=317355 meanCoverage=18.0372



As you can see, values are very close (~0.15% difference, which comes from the fact that we bin the genome in windows, 400 by default). Maybe you could try the same procedure with your data and let us know if the results are different than those from mpileup. 

Same applies to the unexpected low coverage reported on chrY: Could you please apply the coverageBed command with a BED file including only chrY exons and send us the results?

I hope this helps!
Regards,
Fernando

--
You received this message because you are subscribed to the Google Groups "QualiMap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Dr. Fernando Garcia
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin
GERMANY

E-Mail: gar...@mpiib-berlin.mpg.de
Telephone: +493028460426
Web   : www.mpiib-berlin.mpg.de 
test.bam
test.bam.bai
test.bed

Eva König

unread,
Mar 29, 2013, 8:18:04 AM3/29/13
to qual...@googlegroups.com, gar...@mpiib-berlin.mpg.de
Hi Fernando,

thank you very much for your quick reply, I appreciate your effort.

I do think that mpileup in fact gives me the correct coverage. I randomly selected a few positions from my mpileup file and checked these positions in IGV. The coverage reported in IGV matched the mpileup coverage.
For the pileup, I used:

samtools mpileup -l exons.bed -Q 0 mysample.bam > mysample.pileup

I also used the test.bam file you send to run QualiMap and coverageBed, I got the same results as you. Then I ran my mpileup test on it and got a mean coverage of 15.26

Next, I ran coverageBed on my data once with the exons.bed file only containing chr1 and once only containing chrY, using exactly the command line that you wrote. The output for chr1 was:
mappedBases=79597055 RegionsSize=6235260 meanCoverage=12.7656

and for chrY ist was:
mappedBases=640463 RegionsSize=96642 meanCoverage=6.62717

so in fact, coverageBam computes (almost) the same results like me with my mpilup method and not the results reported by QualiMap.

I also ran coverageBam on the whole genome, there the output was:
mappedBases=756526539 RegionsSize=62085295 meanCoverage=12.1853

which is identical to the QualiMap report in the Summary subsection "Coverage (inside of region)" which gives the coverage for all chromosomes.

So is it possible that QualiMap computes the correct coverage genomewide, but that there is some error in reporting the coverage per chromosome?

Let me know if you would like to do some other tests.

Best
Eva

Fernando Garcia

unread,
Apr 3, 2013, 2:03:39 PM4/3/13
to qual...@googlegroups.com
Hi Eva,

I think we are getting close to a solution. I see three different issues here:

1) As you pointed out we are providing misleading information in the coverage average per chromosome. In the HTML report we provide what we call "Per chromosome statistics (inside of regions)", however we compute the average coverage of the whole chromosome, i.e. we divide by the length of the chromosome instead of by the length of the regions belonging to the chromosome. Same thing happens for the "outside of regions" computation. We are already fixing this and it will be available as soon as possible in a new release.

2) Still it was bothering me that coverageBed / qualimap  outputs are different from your mpileup command. I think I found the reason. Example from the previously sent test data:

coverageBed -abam test.bam -b test.bed -d | awk '{c+=$8;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

mappedBases=5724205 RegionsSize=317355 meanCoverage=18.0372


Please note the number of mapped bases (5724205)

Now from your command and setting the regions length to 317355 we get the result you provided in your last email (15.26):

samtools mpileup -l test.bed -Q 0 test.bam | awk 'BEGIN{regionLen=317355}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'

mappedBases=4845542 meanCoverage=15.2685

So the number of mapped bases is less with mpileup (4845542). I searched online and I found this: http://seqanswers.com/forums/showthread.php?t=17438. Summarizing, mpileup discards some reads that are considered as "anomalous". Qualimap relies solely in the provided BAM file and consequently does not filter any reads. If we set the -A parameter in mpileup:

samtools mpileup -l test.bed -Q 0 test.bam -A | awk 'BEGIN{regionLen=317355}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'

mappedBases=5724205 meanCoverage=18.0372

We get now the exact same numbers. Could you please test this with your data?


3) While testing all this we detected a small problem with the coverage computation of reads containing deletions. We are also correcting it and the fix will be included in the coming release. Please note that this is a very minor issue since the discrepancies on the global scale will be small.


Thanks again for your feedback and sorry about the late response... Easter holidays are to blame!

Fernando

Eva König

unread,
Apr 4, 2013, 8:50:30 AM4/4/13
to qual...@googlegroups.com, gar...@mpiib-berlin.mpg.de
Hi Fernando,

I am glad we could figure out the problem and I am looking forward to the new release. I agree that the difference seems to be the -A option of samtools mpileup. I ran this on my data as well, but in my case, there are no great differences. Apparantly, I don't have many "anomalous" reads in my alignment:

samtools mpileup -l chr1.bed -Q 0 my.bam | awk 'BEGIN{regionLen=6235260}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'
mappedBases=75666421 meanCoverage=12.1352

samtools mpileup -l chr1.bed -Q 0 my.bam -A | awk 'BEGIN{regionLen=6235260}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'
mappedBases=78768702 meanCoverage=12.6328


Maybe this is a good moment to explain, why I did the mpileup in the first place. I do like the qualimap reports alot, but when it comes to coverage, I find it insufficient to only know the mean coverage per chromosome. I am also interested in how evenly the reads are distributed across the target region. I also observe that the mean is strongly influenced by some outliers of extremely high coverage. That's why I find the median more informative. Currently, I use the vector from the mpileup coverage to compute box-whisker plots for every chromosome.

I'd appreciate, if you and the development team can discuss adding some additional plots and statistics to qualimap, e.g. per-chromosome box whisker plots, per-chromosome median. Also, even though I like the first plot "Coverage across the reference", it is difficult to extract information from it, because especially for human, the resolution is very to low. I understand that this is because of the large size of the human genome and making one such plot for each chromosome would probably result in too many plots. Nevertheless, maybe you come up with some ideas of how to better visualize the coverage, that would be really great.

Thanks alot.
Eva
Reply all
Reply to author
Forward
0 new messages