Reducing INDEL calling errors in whole-genome and exome sequencing data
Abstract:
BackgroundINDELs, especially those disrupting protein-coding regions of the genome,
have been strongly associated with human diseases. However, there are still many errors
with INDEL variant calling, driven by library preparation, sequencing biases, and
algorithm artifacts.MethodsWe characterized whole genome sequencing (WGS), whole exome
sequencing (WES), and PCR-free sequencing data from the same samples to investigate
the sources of INDEL errors. We also developed a classification scheme based on the
coverage and composition to rank high and low quality INDEL calls. We performed a
large-scale validation experiment on 600 loci, and find high-quality INDELs to have
a substantially lower error rate than low quality INDELs (7% vs. 51%).ResultsSimulation
and experimental data show that assembly based callers are significantly more sensitive
and robust for detecting large INDELs (>5?bp) than alignment based callers, consistent
with published data. The concordance of INDEL detection between WGS and WES is low
(52%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation
rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs
(85% vs. 54%), and WES misses many large INDELs. In addition, the concordance for
INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard
WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate
detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than
that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality
INDEL calls, and they are highly enriched in the WES data.ConclusionsOverall, we show
that accuracy of INDEL detection with WGS is much greater than WES even in the targeted
region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed
to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing
practice, the deeper coverage may save total project costs because of the greater
accuracy and sensitivity. Finally, we investigate sources of INDEL errors (e.g. capture
deficiency, PCR amplification, homopolymers) with various data that will serve as
a guideline to effectively reduce INDEL errors in genome sequencing.