QualiMap v 2.2.1 bamqc reporting incorrect total number of reads

790 views
Skip to first unread message

Sonia Agrawal

unread,
Oct 5, 2016, 6:07:14 PM10/5/16
to QualiMap
Hi

I have whole genome sequencing data from plasmodium genomes. I aligned the genomes using bwa-0.7.13. I ran samtools stats as well as QualiMap v 2.2.1 on the BAM files after sorting and marking duplicates. However, I get different numbers from the 2 software for all the stats such as total number of reads, mapped reads, mapped bases, paired reads, etc. Even the total number of reads reported by Qualimap are much higher than samtools or calculated from original FASTQ files. Please advice why I am seeing these discrepancies.

Thanks
Sonia

Konstantin Okonechnikov

unread,
Oct 6, 2016, 10:57:06 AM10/6/16
to qual...@googlegroups.com
HI!

Note that in the alignment procedure not-aligned reads are usually discarded from the result, therefore the number of reads should be lower then in initial FASTQ. However, option to allow multiple alignment for a read can increase number of resulting alignments.

Basically results from Qualimap and samtools stats should be the same in values: mapped and total, also probably properly paired.  Other values in Qualimap can be different since samtools takes into account non-passing/unmapped reads records. Also, there could be influence of SAM format flag options of the applied alignment tool.

What where the settings for bwa?  Could you perhaps provide detailed numbers reported by samtools stats and by Qualimap BAMQC for the same BAM file?


--
  Konstantin







--
You received this message because you are subscribed to the Google Groups "QualiMap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sonia Agrawal

unread,
Oct 6, 2016, 4:40:50 PM10/6/16
to QualiMap, k.okone...@gmail.com
Hi Konstantin

Thanks for the reply. Here is my command for BWA alignment:

/usr/local/packages/bwa-0.7.13/bin/bwa mem -M -R @RG\tID:K00134.L002\tSM:Pv2016_06\tPU:K00134HF33JBBXX.L002\tPI:789\tLB:IL100073949\tPL:ILLUMINA /local/scratch/sagrawal/ReferenceGenomes/Pvivax/PlasmoDB-28_PvivaxSal1_Genome.indexed L002_R1_trimmed.fastq.gz L002_R2_trimmed.fastq.gz

Post alignment, I ran

picard.sam.CleanSam INPUT=L002_full_Pv.sam OUTPUT=L002_full_Pv.cleaned.sam    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

picard.sam.FixMateInformation INPUT=L002_full_Pv.cleaned.sam   ASSUME_SORTED=false ADD_MATE_CIGAR=true IGNORE_MISSING_MATES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

picard.sam.SortSam INPUT=L002_full_Pv.cleaned.sam OUTPUT=L002_full_Pv.sorted.bam SORT_ORDER=coordinate    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

picard.sam.markduplicates.MarkDuplicates INPUT=[L002_full_Pv.sorted.bam] OUTPUT=L002_full_Pv.marked_dups.bam METRICS_FILE=L002_full_Pv.marked_dups_metrics.txt ASSUME_SORTED=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

Stats from samtools:

SN      raw total sequences:    38228224
SN      filtered sequences:     0   
SN      sequences:      38228224
SN      is sorted:      1   
SN      1st fragments:  19114112
SN      last fragments: 19114112
SN      reads mapped:   33882473
SN      reads mapped and paired:        32441118        # paired-end technology bit set + both mates mapped
SN      reads unmapped: 4345751
SN      reads properly paired:  31370896        # proper-pair bit set
SN      reads paired:   38228224        # paired-end technology bit set
SN      reads duplicated:       8294174 # PCR or optical duplicate bit set
SN      reads MQ0:      1116143 # mapped and MQ=0
SN      reads QC failed:        0   
SN      non-primary alignments: 821727
SN      total length:   5679176021      # ignores clipping
SN      bases mapped:   5039195716      # ignores clipping
SN      bases mapped (cigar):   4623615499      # more accurate
SN      bases trimmed:  0
SN      bases duplicated:       1236704963
SN      mismatches:     91120057        # from NM fields
SN      error rate:     1.970753e-02    # mismatches / bases mapped (cigar)
SN      average length: 148
SN      maximum length: 151
SN      average quality:        34.9
SN      insert size average:    431.9
SN      insert size standard deviation: 191.9
SN      inward oriented pairs:  14566086
SN      outward oriented pairs: 391934
SN      pairs with other orientation:   9652
SN      pairs on different chromosomes: 512065

Stats from Qualimap v 2.2.1 bamqc

BamQC report
-----------------------------------

>>>>>>> Input

     bam file = L002_full_Pv.marked_dups.bam
     outfile = L002_full_Pv.marked_dups_stats/genome_results.txt


>>>>>>> Reference

     number of bases = 27,013,980 bp
     number of contigs = 2748


>>>>>>> Globals

     number of windows = 3147

     number of reads = 39,049,951
     number of mapped reads = 34,704,200 (88.87%)

     number of mapped paired reads (first in pair) = 17,841,931
     number of mapped paired reads (second in pair) = 16,862,269
     number of mapped paired reads (both in pair) = 33,192,394
     number of mapped paired reads (singletons) = 1,511,806

     number of mapped bases = 4,662,396,332 bp
     number of sequenced bases = 4,654,219,425 bp
     number of aligned bases = 0 bp
     number of duplicated reads (flagged) = 8,294,174


>>>>>>> Insert size

     mean insert size = 1,950.6927
     std insert size = 41,889.1495
     median insert size = 405


>>>>>>> Mapping quality

     mean mapping quality = 32.2599


>>>>>>> ACTG content

     number of A's = 1,333,236,883 bp (28.65%)
     number of C's = 985,505,979 bp (21.17%)
     number of T's = 1,352,221,877 bp (29.05%)
     number of G's = 983,254,686 bp (21.13%)

Both samtools and Qualimap were run on exactly the same BAM file. I also ran number of reads in the original FASTQ file using the command and got same results as samtools reported but not Qualimap:

zcat L002_R1_trimmed.fastq.gz|grep -c "^@"
19114112
zcat L002_R2_trimmed.fastq.gz|grep -c "^@"
19114112

Thanks
Sonia
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+u...@googlegroups.com.

Konstantin Okonechnikov

unread,
Oct 10, 2016, 8:06:56 AM10/10/16
to qual...@googlegroups.com
Hi Sonia,

thanks a lot for detailed report!

Qualimap is reporting higher number of alignments while it's not ignoring secondary. 

From samtools:


SN      raw total sequences:    38228224
SN      non-primary alignments: 821727

Which means 821727 +  38228224 = 39049951, as it is reported by Qualimap.

The same is similar for number of mapped reads (reads mapped plus non-primary alignments) Actually, the secondary alignments were taken into account in RNA-seq QC mode already, they are typical for RNA-seq therefore estimated there. However, now this was updated in BAM QC as well.

Here's the link for novel version:

Would be great if you could check it.

--
  Konstantin





 




To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+unsubscribe@googlegroups.com.

Sonia Agrawal

unread,
Oct 10, 2016, 12:44:22 PM10/10/16
to QualiMap, k.okone...@gmail.com
Hi Konstantin

Thanks for the quick fix. I re-ran the updated code and most of the statistics match to samtools. However, the number of bases mapped, insert sizes are still different as compared to samtools. Also, how is number of sequenced bases fewer then number of mapped bases? I also calculated depth using samtools depth and the number of reads for each position are lower than the numbers reported in qualimap coverage file. Is the coverage file still calculating numbers including non-primary alignments? Find below the output from the new run of Qualimap:

bam file = EPVIV_20160822_K00134_IL100073949_S31_L002_full_Pv.marked_dups.bam
     outfile = temp/genome_results.txt



>>>>>>> Reference

     number of bases = 27,013,980 bp
     number of contigs = 2748


>>>>>>> Globals

     number of windows = 3147

     number of reads = 38,228,224
     number of mapped reads = 33,882,473 (88.63%)
     number of secondary alignments = 821,727

     number of mapped paired reads (first in pair) = 17,501,756
     number of mapped paired reads (second in pair) = 16,380,717
     number of mapped paired reads (both in pair) = 32,441,118
     number of mapped paired reads (singletons) = 1,441,355

     number of mapped bases = 4,626,157,011 bp
     number of sequenced bases = 4,618,058,082 bp

     number of aligned bases = 0 bp
     number of duplicated reads (flagged) = 8,294,174


>>>>>>> Insert size

     mean insert size = 1,156.216
     std insert size = 30,506.08

     median insert size = 405


>>>>>>> Mapping quality

     mean mapping quality = 32.4585


>>>>>>> ACTG content

     number of A's = 1,320,418,355 bp (28.59%)
     number of C's = 983,076,818 bp (21.29%)
     number of T's = 1,333,689,978 bp (28.88%)
     number of G's = 980,872,931 bp (21.24%)
     number of N's = 526,488 bp (0.01%)

     GC percentage = 42.53%

>>>>>>> Mismatches and indels

    general error rate = 0.0197
    number of mismatches = 86,089,128
    number of insertions = 1,331,667
    mapped reads with insertion percentage = 3.62%
    number of deletions = 1,672,138
    mapped reads with deletion percentage = 4.46%
    homopolymer indels = 56.68%


>>>>>>> Coverage

     mean coverageData = 171.2505X
     std coverageData = 119.6328X

Thanks a lot once again for your speedy response and quick fix.

Sonia

Konstantin Okonechnikov

unread,
Oct 11, 2016, 12:00:40 PM10/11/16
to qual...@googlegroups.com
Hi Sonia,

thanks for checking the novel version!

Exactly from this version non-primary alignments are ignored and not applied in any upcoming analysis operations. The variability in insert size could be because of samtools limit in insert size collection (8000 bp by default), while Qualimap does not use any limits. There could be other options that influence the results. Check samtools stats -h for details. 

Importantly, number of mapped bases is collected from alignments (i.e. if there insertion or other change in the alignment, it is considered by Qualimap algorithm covering the reference), while the number of sequenced bases is computed only from reads. 

It's easy to see that number of bases mapped CIGAR  (4623615499)  reported by samtools is quite similar to results in Qualimap (4626157011). Also, number of mapped bases is applied for coverage computation. However the difference should not be high in final mean values between samtools and Qualimap.  

--
  Konstantin





To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+unsubscribe@googlegroups.com.

Esther Kockelmans

unread,
Feb 3, 2022, 7:17:16 AM2/3/22
to QualiMap
Hi Konstantin,

I am also a but confused about the Qualimap output.
My dataset consists of 12,606,144 reads. I have aligned these reads against two different assemblies.
Qualimap tells me that there are 13,488,573 (for assembly nr. 1) and 13,788,026 (for assembly nr. 2) reads in my dataset, which is incorrect.. I think Qualimap calculates the amount of reads by doing a line count on the sorted bam file, since this is similar to 13,488,573 and 13,788,026. But now I am a bit confused about the amount of "mapped reads". The amount of mapped reads is higher than the amount of reads I used for the alignment.. but how is that possible? Is this also fixed with the new Qualimap version?

Hope you can help me with this.

Kind regards,
Esther


Op dinsdag 11 oktober 2016 om 18:00:40 UTC+2 schreef Konstantin Okonechnikov:
Message has been deleted

Konstantin Okonechnikov

unread,
Nov 10, 2022, 9:12:07 AM11/10/22
to qual...@googlegroups.com
Hi, 

most likely the issue that there are secondary alignments of the same read in the BAM file. The tool focuses not on reads, but on the alignments thus this could lead to increase since secondary are not skipped.  Do you have a full report from the BAM file? Or a subsample from BAM file (e.g. only 1 chromosome)? I can take a look to double check.

Best regards,
  Konstantin



On Thu, Nov 3, 2022 at 11:44 AM Shenu Nucleome <sh...@nucleomeinfo.com> wrote:

Hi,

I'm also facing the same issue. Did you find the solution?
Reply all
Reply to author
Forward
0 new messages