Hi Shruti,
You have a couple of options. You can use rtg vcfstats --allele-lengths which outputs a table of counts of the various categories, including SNPs/MNPs, e.g:
$ rtg vcfstats --allele-lengths /data/human/variant/CEPH_Platinum/SegregationPhasing/sp38.2.3/phasing_gold.vcf.gz --sample NA12877
Location : /data/human/variant/CEPH_Platinum/SegregationPhasing/sp38.2.3/phasing_gold.vcf.gz
Failed Filters : 0
Passed Filters : 5734934
Complex Called : 1418657
SNPs : 3574516
MNPs : 41391
Insertions : 292992
Deletions : 301417
Indels : 106419
Same as reference : 1418199
De Novo Genotypes : 3554
Phased Genotypes : 96.8% (5549794/5734934)
SNP Transitions/Transversions: 2.04 (3349131/1644010)
Total Haploid : 99774
Haploid SNPs : 78977
Haploid MNPs : 823
Haploid Insertions : 9050
Haploid Deletions : 8922
Haploid Indels : 2002
Total Het/Hom ratio : 1.52 (2544161/1672800)
SNP Het/Hom ratio : 1.47 (2079458/1416081)
MNP Het/Hom ratio : 1.28 (22753/17815)
Insertion Het/Hom ratio : 1.65 (176792/107150)
Deletion Het/Hom ratio : 1.76 (186531/105964)
Indel Het/Hom ratio : 3.05 (78627/25790)
Insertion/Deletion ratio : 0.97 (292992/301417)
Indel/SNP+MNP ratio : 0.19 (700828/3615907)
Variant Allele Lengths :
length SNP MNP Delete Insert Indel
1 5009895 0 220777 219050 41803
2 0 49499 74173 73594 22513
3 0 4759 27286 27180 8323
4 0 1261 45955 45296 12240
5 0 1237 11909 11623 3902
6 0 673 13321 12575 5902
7 0 903 3556 3132 1967
8 0 359 11004 10937 4513
9 0 520 3090 3010 1456
10 0 325 5956 5486 2851
11 0 322 1874 1630 930
12 0 224 6344 5829 2737
13 0 293 1456 1214 728
14 0 219 2675 2346 1493
15 0 172 1739 1663 715
16 0 173 2907 2735 1519
17 0 189 795 751 478
18 0 149 1678 1408 895
19 0 156 701 643 395
20 0 110 1886 1767 1100
21 0 167 592 562 374
22 0 108 943 776 616
23 0 93 398 414 225
24 0 85 1138 1073 675
25 0 110 490 462 307
26 0 80 556 460 448
27 0 98 323 332 188
28 0 75 665 591 414
29 0 84 261 235 160
30 0 48 483 411 318
31 0 73 196 178 154
32 0 49 467 341 323
[...]
You will have to do some simple parsing to extract the lengths table out. Note that this length table is specific to the genotypes in the selected sample and includes all the sample alleles, e.g. if the GT is 1/2 it'll increment the count for both of the alleles referenced, and a 1/1 will increment the allele twice. (Whereas the initial vcfstats output section keeps things simpler by giving each GT as a whole a single classification into one of the categories based on the most complex of the called alleles). In terms of the underlying allele classification used by vcfstats, there are some simple normalizations that go on (mostly so the results are sensible in the presence of multi-allelic records and/or multi-sample VCFs, and reference matching padding bases), e.g:
SNPs (single base change)
A -> G
ATGC -> ATCC
ATGC -> ACGC
MNPs (multiple bases change, but the length does not)
ATGC -> GGCC (length 3)
ATGC -> GTCT (length 4)
For the insertions/deletions/indels the table is based on the delta in length rather than total length (which really matters for the indels):
Insertions (pure addition of bases)
A -> AT (length 1 insertion)
ATT -> ATTTT (length 2 insertion)
Deletions (pure deletion of bases)
AT -> A (length 1 deletion)
ATTTT -> ATT (length 2 deletion)
Indels (length changing but not pure)
ATT -> CTTT (length 1 indel)
CTTT -> ATT (length 1 indel)
Cheers,
Len.