Does Delly have and upper limit for the size of the structural variants called?

88 views

Skip to first unread message

Geoffrey Thomson

unread,

Nov 15, 2023, 1:36:53 PM11/15/23

to delly-users

Hi Tobias

Thank you for delly2, it is very polished and it is great you maintain this forum.

I have some samples with a large number of structural variants, almost all of which are some sort of duplication. They are obvious when I plot the ratio of normalized coverage against a control sample as large regions of the genome appear amplified.

I have run delly2 and it appears the program finds a majority of the amplifications I can see (~84-87%). However I have a few questions about the output and the few amplified regions Delly did not identify.

I ran the program following the documentation on the github page with the final call looking something like this:

delly call -g genomePath -v dellyMergeOutput -o output.bcf -d SVreads.gz bamFile

1a) Does Delly have and upper limit for the size of the structural variants called?

I ask because I have a handful of large (>1Mb) regions which are clearly amplified but delly does not detect (e.g. Fig a and b). I know there are edge cases that delly would miss but ideally they would not be the largest SVs as they tend to stand out. Is there a hardcoded limit on SVs delly detects and if so could I tweak it somehow?

1b) Delly does not appear to identify SVs well (e.g. Fig b). Is there a way to improve this?

In several of my SVs it looks like there are distinct amplification events overlapping one another in the same sample, likely on the same chromosome as some regions within the SV will have much greater coverage than others. In many cases Delly appears to only recognize the smaller one as if once it has identified one SV in that location it does not consider the potential for others to be present. Is this the case and again is there a parameter I can change?

1c) Does Delly require both junctions to be present to call a SV?

In a few cases I have some amplified regions which are next to regions with very poor mapability. An example is given in Fig c where the junction on the left hand side intersects a region that has a lot of split (SA tagged) reads mapping to it. Thus while I can see a peak of split reads at the left hand side many also map to various other sites in the genome with poor mapability. I know this is a hard problem but do you have a solution?

2) In the output of Delly what is the relationship between INFO/PE and FORMAT/RR. Likewise INFO/SE and FORMAT/RV?

My intuition is that they should be reporting the same thing. The number of reads supporting a variant call from either mate pairs either side of the junction or slit reads which span the junction. Since the PE and DE values are only one value I expected them to represent one sample, maybe the first sample to have the variant. I understand the RR and DV are specific to each sample. However I can't see any corresponence between the two metrics.

> bcftools query -i "INFO/SR>=10" -f'%CHROM %POS %INFO/END %ID %INFO/PRECISE %INFO/SVTYPE %FILTER %INFO/CT %INFO/PE %INFO/SR --- [%DR ] [%DV ] [%RR ] [%RV ]\n' RQ14595_sites_merged.bcf | head

Chr1 2247090 2428855 DUP00000016 1 DUP PASS 5to3 55 20 --- 0 0 0 0 0 0 44 57 0 0 130 115 95 103 96 0 116 155 0 0
Chr1 2294892 2296187 INV00000024 1 INV PASS 3to3 5 20 --- 1 1 0 0 1 9 7 10 5 3 132 209 100 145 150 54 41 150 86 60
Chr1 3247148 3427671 DUP00000041 1 DUP PASS 5to3 40 20 --- 0 0 0 0 0 0 33 44 0 0 121 134 111 111 117 0 97 139 0 0
Chr1 7485286 7727428 DUP00000057 1 DUP PASS 5to3 18 20 --- 0 0 0 0 0 0 19 0 0 0 95 98 79 82 75 0 63 0 0 0
Chr1 7864750 7867681 INV00000063 1 INV PASS 3to3 0 20 --- 0 0 0 0 0 0 0 0 0 0 138 133 141 142 157 19 9 22 38 37
Chr1 8603509 8604582 INV00000077 1 INV PASS 3to3 0 20 --- 0 0 0 0 0 0 0 0 0 0 153 120 115 127 127 72 30 57 79 76
Chr1 8603697 8604372 INV00000079 1 INV PASS 3to3 0 14 --- 0 0 0 0 0 0 0 0 0 0 135 100 121 126 105 25 16 37 55 30
Chr1 8890608 9340601 DUP00000083 1 DUP PASS 5to3 28 20 --- 0 0 1 0 0 0 0 33 0 0 153 92 102 79 94 0 0 78 0 0
Chr1 8975278 9069999 DUP00000086 1 DUP PASS 5to3 18 20 --- 0 0 0 0 0 0 18 0 0 0 136 99 221 90 84 0 35 0 0 0
Chr1 9000541 9000590 DEL00000087 1 DEL PASS 3to5 0 20 --- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 142 154 173 110 108

3) What is the CNV calling supposed to do? I expected it to be an estimate of the number of duplications within a given SV. Also, since one of the inputs is the merged BCF file created from SV calling I expected them to align with these identified regions. However the CNVs found do not cover the entire span of the SVs, when they are they are fragmented and sometimes occur in locations no identified as SVs by delly. (Fig d). Is this correct?

Again I followed the documentation on the github page with a final call looking something like:

delly cnv -u -v dellyMergeOutput -o output.bcf -g genomePath -m map.fa.gz -l mergedSVOutput bamFile