Hi Konstantin
Thanks for the reply. Here is my command for BWA alignment:
/usr/local/packages/bwa-0.7.13/bin/bwa mem -M -R @RG\tID:K00134.L002\tSM:Pv2016_06\tPU:K00134HF33JBBXX.L002\tPI:789\tLB:IL100073949\tPL:ILLUMINA /local/scratch/sagrawal/ReferenceGenomes/Pvivax/PlasmoDB-28_PvivaxSal1_Genome.indexed L002_R1_trimmed.fastq.gz L002_R2_trimmed.fastq.gz
Post alignment, I ran
picard.sam.CleanSam INPUT=L002_full_Pv.sam OUTPUT=L002_full_Pv.cleaned.sam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
picard.sam.FixMateInformation INPUT=L002_full_Pv.cleaned.sam ASSUME_SORTED=false ADD_MATE_CIGAR=true IGNORE_MISSING_MATES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
picard.sam.SortSam INPUT=L002_full_Pv.cleaned.sam OUTPUT=L002_full_Pv.sorted.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
picard.sam.markduplicates.MarkDuplicates INPUT=[L002_full_Pv.sorted.bam] OUTPUT=L002_full_Pv.marked_dups.bam METRICS_FILE=L002_full_Pv.marked_dups_metrics.txt ASSUME_SORTED=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
Stats from samtools:
SN raw total sequences: 38228224
SN filtered sequences: 0
SN sequences: 38228224
SN is sorted: 1
SN 1st fragments: 19114112
SN last fragments: 19114112
SN reads mapped: 33882473
SN reads mapped and paired: 32441118 # paired-end technology bit set + both mates mapped
SN reads unmapped: 4345751
SN reads properly paired: 31370896 # proper-pair bit set
SN reads paired: 38228224 # paired-end technology bit set
SN reads duplicated: 8294174 # PCR or optical duplicate bit set
SN reads MQ0: 1116143 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 821727
SN total length:
5679176021 # ignores clipping
SN bases mapped:
5039195716 # ignores clipping
SN bases mapped (cigar): 4623615499 # more accurate
SN bases trimmed: 0
SN bases duplicated: 1236704963
SN mismatches: 91120057 # from NM fields
SN error rate: 1.970753e-02 # mismatches / bases mapped (cigar)
SN average length: 148
SN maximum length: 151
SN average quality: 34.9
SN insert size average: 431.9
SN insert size standard deviation: 191.9
SN inward oriented pairs: 14566086
SN outward oriented pairs: 391934
SN pairs with other orientation: 9652
SN pairs on different chromosomes: 512065
Stats from Qualimap v 2.2.1 bamqc
BamQC report
-----------------------------------
>>>>>>> Input
bam file = L002_full_Pv.marked_dups.bam
outfile = L002_full_Pv.marked_dups_stats/genome_results.txt
>>>>>>> Reference
number of bases = 27,013,980 bp
number of contigs = 2748
>>>>>>> Globals
number of windows = 3147
number of reads = 39,049,951
number of mapped reads = 34,704,200 (88.87%)
number of mapped paired reads (first in pair) = 17,841,931
number of mapped paired reads (second in pair) = 16,862,269
number of mapped paired reads (both in pair) = 33,192,394
number of mapped paired reads (singletons) = 1,511,806
number of mapped bases = 4,662,396,332 bp
number of sequenced bases = 4,654,219,425 bp
number of aligned bases = 0 bp
number of duplicated reads (flagged) = 8,294,174
>>>>>>> Insert size
mean insert size = 1,950.6927
std insert size = 41,889.1495
median insert size = 405
>>>>>>> Mapping quality
mean mapping quality = 32.2599
>>>>>>> ACTG content
number of A's = 1,333,236,883 bp (28.65%)
number of C's = 985,505,979 bp (21.17%)
number of T's = 1,352,221,877 bp (29.05%)
number of G's = 983,254,686 bp (21.13%)
Both samtools and Qualimap were run on exactly the same BAM file. I also ran number of reads in the original FASTQ file using the command and got same results as samtools reported but not Qualimap:
zcat L002_R1_trimmed.fastq.gz|grep -c "^@"
19114112
zcat L002_R2_trimmed.fastq.gz|grep -c "^@"
19114112
Thanks
Sonia