Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Variants predicted to result in the loss of function of human genes have attracted interest because of their clinical impact and surprising prevalence in healthy individuals. Here, we present ALoFT (annotation of loss-of-function transcripts), a method to annotate and predict the disease-causing potential of loss-of-function variants. Using data from Mendelian disease-gene discovery projects, we show that ALoFT can distinguish between loss-of-function variants that are deleterious as heterozygotes and those causing disease only in the homozygous state. Investigation of variants discovered in healthy populations suggests that each individual carries at least two heterozygous premature stop alleles that could potentially lead to disease if present as homozygotes. When applied to de novo putative loss-of-function variants in autism-affected families, ALoFT distinguishes between deleterious variants in patients and benign variants in unaffected siblings. Finally, analysis of somatic variants in >6500 cancer exomes shows that putative loss-of-function variants predicted to be deleterious by ALoFT are enriched in known driver genes.
One of the most notable findings from personal genomics studies is that all individuals harbor loss-of-function (LoF) variants in some of their genes1. A systematic study of LoF variants from the 1000 Genomes Project revealed that there are over 100 putative LoF (pLoF) variants in each individual2,3,4. Recently, a larger study aimed at elucidating rare LoF events in 2636 Icelanders generated a catalog of 1171 genes that contain either homozygous or compound heterozygous LoF variants with a minor allele frequency less than 2%5. Thus, several genes are knocked out either completely or in an isoform-specific manner. The discovery of protective LoF variants associated with beneficial traits and their potential to enable identification of valuable drug targets has fueled an increased interest in pLoF variants. For example, nonsense variants in PCSK9 are associated with low low-density lipoprotein (LDL) levels6, which prompted the active pursuit of the inhibition of PCSK9 as a potential therapeutic for hypercholesterolemia7, 8 and led to the development of two drugs that have been recently approved by the FDA. Other examples include nonsense and splice mutations in APOC3 associated with low levels of circulating triglycerides, a nonsense mutation in SLC30A8 resulting in about 65% reduction in risk for Type II diabetes, two splice variants in the Finnish population in LPA that protect against coronary artery disease, and two LoF-producing splice variants and a nonsense mutation in HAL associated with increased blood histidine levels and reduced risk of coronary artery disease9,10,11,12.
About 12% of known disease-causing mutations in the Human Gene Mutation Database (HGMD) are due to nonsense mutations13. Even though premature stop variants often lead to loss of function and are thus deleterious, predicting the functional impact of premature stop codons is not straightforward. Aberrant transcripts containing premature stop codons are typically removed by nonsense-mediated decay (NMD), an mRNA surveillance mechanism14. However, a recent large-scale expression analysis demonstrated that 68% of predicted NMD events due to premature stop variants are unsupported by RNA-Seq analyses15. Moreover, premature stop codons in the last exon are generally not subject to NMD. A study aimed at understanding disease mutations using a 3D structure-based interaction network suggests that truncating mutations can give rise to functional protein products16. Furthermore, when a variant affects only some isoforms of a gene, it is difficult to infer its impact on gene function without the knowledge of the isoforms that are expressed in the tissue of interest and how their levels of expression affect gene function. Finally, loss of function of a gene might not have any impact on the fitness of the organism.
While there are several algorithms to predict the effect of missense coding variants on protein function, there is a paucity of methods that are applicable to nonsense variants17,18,19. Additionally, current prediction methods that infer the pathogenicity of variants do not take into account the zygosity of the variant20, 21. The majority of pLoF variants in healthy cohorts are heterozygous. It is likely that a subset of these variants will cause disease as homozygotes.
Here we present a pipeline called ALoFT (Annotation of Loss-of-Function Transcripts), that provides extensive annotation of pLoF variants. Furthermore, we developed a prediction model to classify pLoF variants into three classes: those that are benign, those that lead to recessive disease (disease-causing only when homozygous) and those that lead to dominant disease (disease-causing as heterozygotes). Finally, we validated the prediction model by applying ALoFT to known disease mutations in Mendelian diseases, autism, and cancer.
Using the annotations output by ALoFT as predictive features (Fig. 1, Supplementary Data 1), we developed a prediction method to infer the pathogenicity of pLoF variants. To build the ALoFT classifier, we used three classes of premature stop variants as training data: benign variants, dominant disease-causing variants, and recessive disease-causing variants (Supplementary Table 2). The benign set includes homozygous premature stop variants discovered in a cohort of 1092 healthy people, Phase1 1000 Genomes data (1KG). Homozygous premature stop mutations from HGMD that lead to recessive disease and heterozygous premature stop variants in haplo-insufficient genes that lead to dominant disease represent the two disease classes3, 28. In addition to loss-of-function effects, truncating mutations can also lead to gain of function. However, gain-of-function mutations are difficult to model systematically as the effect of a variant can only be understood in the context of the biology of the gene and can vary widely for different genes and gene classes. In order to minimize errors that might arise due to inadequate modeling of gain-of-function effects and to focus on LoF, we only use predicted haploinsufficient genes as the training data for the dominant model. We built the ALoFT classifier to distinguish among the three classes using a random forest algorithm29 (details in Methods). For each mutation, ALoFT provides three class probability estimates, and we obtain good discrimination between each class. The prediction output provides the three scores for each pLoF variant that correspond to the probability of the pLoF being benign, dominant or recessive disease-causing allele. In addition, ALoFT also provides the predicted pathogenicity. The pathogenic effect of pLoF variant is assigned to the class that corresponds to the maximum score.
Schematic workflow. ALoFT uses a VCF file as input and annotates premature stop, frameshift-causing indel and canonical splice-site mutations with functional, conservation, and network features. ALoFT also flags potential mismapping and annotation errors. Using the annotation features, ALoFT predicts the pathogenicity (as either benign, recessive, or dominant disease-causing) of premature stop and frameshift mutations based on a model trained on known data. ALoFT can also take as input a five-column tab-delimited file containing chromosome, position, variant ID, reference allele, and alternate allele as its columns
The classifier is robust to the choice of training data sets (Supplementary Table 3, details in Methods). Though trained with premature stop SNVs, our method is also applicable to frameshift indels. We applied ALoFT to classify pathogenic indels in HGMD. 99.4% of HGMD disease-causing frameshift indels are predicted to be pathogenic based on the maximum ALoFT score.
We evaluated ALoFT by predicting the effect of known disease-causing premature stop mutations from ClinVar31 (details in Methods) and predicted the mode of inheritance and pathogenicity of all of truncating variants (Fig. 2a). ALoFT is clearly able to distinguish between pLoFs that lead to disease in a heterozygous state vs. those that do so only in a homozygous state. Our method shows that heterozygous disease-causing variants have significantly higher dominant disease-causing scores than the homozygous disease-causing variants (p-value: 1.3e-13; Wilcoxon rank-sum test). We used two other measures, GERP score, which is a measure of evolutionary conservation, and CADD score, which gives a measure of pathogenicity, to classify recessive vs. dominant pLoF variants32. Both CADD (p-value: 0.13; Wilcoxon rank-sum test) and GERP (p-value: 0.49; Wilcoxon rank-sum test) scores are not able to discriminate between recessive and dominant disease-causing mutations (Fig. 2a). We also tested our method on a smaller data set from the Center For Mendelian Genomics studies33 and were able to correctly recapitulate the pathogenic effect of pLoF variants and their inheritance pattern (Fig. 2b).
ALoFT classification of de novo premature stop variants from autism studies. a The top two panels show the ALoFT dominant scores of de novo premature stop mutations in autism patients and siblings; mutations in patients are further separated by gender, as shown in the bottom two panels. b ALoFT dominant prediction scores for autism de novo pLoFs in confident risk genes. In this plot, the center line represents the median value of the data, the box goes from the first quartile to the third quartile. The lower whisker goes from Q1 to the smallest non-outlier in the data set, and the upper whisker goes from Q3 to the largest non-outlier in the data set
c80f0f1006