Distribution of Indel size from a vcf using RTG?

440 views
Skip to first unread message

msh...@stanford.edu

unread,
Oct 18, 2017, 4:16:18 PM10/18/17
to RTG Users

Hi Lenn,


I have been using RTG for both vcf comparison (vcfeval) and variant calling. I want to generate Distribution of Indel size from a vcf and plot it.

Is there a command/way to generate Distribution of Indel size from a vcf using RTG?


I can use bcftools’s vcfstats function and then plot it. However, RTG and bcftools do not completely agree on definition of indels (primarily Indels and MNPs, although they typically agree for Insertions and Deletions). Since I have used RTG for most of my analysis, I would like to stick to it. 


Thanks,

Shruti

Len Trigg

unread,
Oct 18, 2017, 5:34:27 PM10/18/17
to msh...@stanford.edu, RTG Users
Hi Shruti,

You have a couple of options. You can use rtg vcfstats --allele-lengths which outputs a table of counts of the various categories, including SNPs/MNPs, e.g:

$ rtg vcfstats --allele-lengths /data/human/variant/CEPH_Platinum/SegregationPhasing/sp38.2.3/phasing_gold.vcf.gz --sample NA12877
Location                     : /data/human/variant/CEPH_Platinum/SegregationPhasing/sp38.2.3/phasing_gold.vcf.gz
Failed Filters               : 0
Passed Filters               : 5734934
Complex Called               : 1418657
SNPs                         : 3574516
MNPs                         : 41391
Insertions                   : 292992
Deletions                    : 301417
Indels                       : 106419
Same as reference            : 1418199
De Novo Genotypes            : 3554
Phased Genotypes             : 96.8% (5549794/5734934)
SNP Transitions/Transversions: 2.04 (3349131/1644010)
Total Haploid                : 99774
Haploid SNPs                 : 78977
Haploid MNPs                 : 823
Haploid Insertions           : 9050
Haploid Deletions            : 8922
Haploid Indels               : 2002
Total Het/Hom ratio          : 1.52 (2544161/1672800)
SNP Het/Hom ratio            : 1.47 (2079458/1416081)
MNP Het/Hom ratio            : 1.28 (22753/17815)
Insertion Het/Hom ratio      : 1.65 (176792/107150)
Deletion Het/Hom ratio       : 1.76 (186531/105964)
Indel Het/Hom ratio          : 3.05 (78627/25790)
Insertion/Deletion ratio     : 0.97 (292992/301417)
Indel/SNP+MNP ratio          : 0.19 (700828/3615907)

Variant Allele Lengths :
length  SNP     MNP     Delete  Insert  Indel
1       5009895 0       220777  219050  41803
2       0       49499   74173   73594   22513
3       0       4759    27286   27180   8323
4       0       1261    45955   45296   12240
5       0       1237    11909   11623   3902
6       0       673     13321   12575   5902
7       0       903     3556    3132    1967
8       0       359     11004   10937   4513
9       0       520     3090    3010    1456
10      0       325     5956    5486    2851
11      0       322     1874    1630    930
12      0       224     6344    5829    2737
13      0       293     1456    1214    728
14      0       219     2675    2346    1493
15      0       172     1739    1663    715
16      0       173     2907    2735    1519
17      0       189     795     751     478
18      0       149     1678    1408    895
19      0       156     701     643     395
20      0       110     1886    1767    1100
21      0       167     592     562     374
22      0       108     943     776     616
23      0       93      398     414     225
24      0       85      1138    1073    675
25      0       110     490     462     307
26      0       80      556     460     448
27      0       98      323     332     188
28      0       75      665     591     414
29      0       84      261     235     160
30      0       48      483     411     318
31      0       73      196     178     154
32      0       49      467     341     323
[...]

You will have to do some simple parsing to extract the lengths table out. Note that this length table is specific to the genotypes in the selected sample and includes all the sample alleles, e.g. if the GT is 1/2 it'll increment the count for both of the alleles referenced, and a 1/1 will increment the allele twice. (Whereas the initial vcfstats output section keeps things simpler by giving each GT as a whole a single classification into one of the categories based on the most complex of the called alleles). In terms of the underlying allele classification used by vcfstats, there are some simple normalizations that go on (mostly so the results are sensible in the presence of multi-allelic records and/or multi-sample VCFs, and reference matching padding bases), e.g:

SNPs (single base change)
A -> G
ATGC -> ATCC
ATGC -> ACGC

MNPs (multiple bases change, but the length does not)
ATGC -> GGCC (length 3)
ATGC -> GTCT (length 4)

For the insertions/deletions/indels the table is based on the delta in length rather than total length (which really matters for the indels):

Insertions (pure addition of bases)
A -> AT (length 1 insertion)
ATT -> ATTTT (length 2 insertion) 

Deletions (pure deletion of bases)
AT -> A (length 1 deletion)
ATTTT -> ATT (length 2 deletion)

Indels (length changing but not pure)
ATT -> CTTT (length 1 indel)
CTTT -> ATT (length 1 indel)

The other option (particularly if you want something independent of the samples, e.g. based on just the REF/ALT declarations) is do compute your own indel length table, e.g using the javascript capability of rtg vcffilter. There is an example in the user manual that shows binned indel length distribution computation (see https://cdn.rawgit.com/RealTimeGenomics/rtg-core/master/installer/resources/core/RTGOperationsManual/rtg_command_reference.html#javascript-examples ), that you could adapt to your needs.


Cheers,
Len.






--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+...@realtimegenomics.com.
Visit this group at https://groups.google.com/a/realtimegenomics.com/group/rtg-users/.

msh...@stanford.edu

unread,
Oct 18, 2017, 8:20:42 PM10/18/17
to RTG Users, msh...@stanford.edu
Thanks Lenn for the detailed explanation. This is very helpful.

msh...@stanford.edu

unread,
Oct 8, 2018, 5:51:55 PM10/8/18
to RTG Users, msh...@stanford.edu

Hi Lenn,


Following up on this old conversation. I am working with a vcf with of indels of large sizes (1-20,000bp). Rtg’s vcfstats --allele-lengths command summarizes the number of indels of length greater than 100bp in intervals. For example:

length Delete

100-199 2337

200-299 905

300-399 1772

400-499 249

This is good to get a big picture. However, I am trying to generate graphs to show the distribution of size of indels (count vs size of indels). It will be helpful to get the exact frequency count for each size of indel in the vcf. How can achieve this?


Thanks,

Shruti

Len Trigg

unread,
Oct 8, 2018, 6:42:34 PM10/8/18
to msh...@stanford.edu, RTG Users

You can use the rtg vcffilter javascript capability to do the counting (check the link I sent previously), but here's a really quick example:

$ rtg vcffilter -i example-panel.vcf.gz --javascript 'function record() {ALT.map(function(alt){print(alt.length - REF.length)})}' | sort -n | uniq -c
      1 -16
      1 -12
      1 -8
      2 -7
      3 -4
      6 -3
      6 -2
     29 -1
    649 0
     31 1
      6 2
      2 3
      1 4
      4 5
      1 6
      3 8
      1 12
      1 14
      1 20

That simply outputs the difference in length between the REF and each ALT allele per record. It ignores any sample information (e.g. het vs hom, multiple samples, etc), but you can extend the javascript to do the counting however you want.

Cheers,
Len.



msh...@stanford.edu

unread,
Oct 10, 2018, 4:22:25 AM10/10/18
to RTG Users, msh...@stanford.edu
Thanks Len, it works. I appreciate your quick response.
Reply all
Reply to author
Forward
0 new messages