Findings: The highest performing machine and statistical learning methods included both rare variants and those known to be causal of resistance for at least one drug. Both simpler L2 penalized regression and complex machine learning models had high predictive performance. The average AUCs for our highest performing model was 0.979 for first-line drugs and 0.936 for second-line drugs during repeated cross-validation. On an independent validation set, the highest performing model showed average AUCs, sensitivities, and specificities, respectively, of 0.937, 87.9%, and 92.7% for first-line drugs and 0.891, 82.0% and 90.1% for second-line drugs. Our method outperforms existing approaches based on direct association, with increased sum of sensitivity and specificity of 11.7% on first line drugs and 3.2% on second line drugs. Our method has higher predictive performance compared to previously reported machine learning models during cross-validation, with higher AUCs for 8 of 10 drugs.
GMMAT is an R package for performing genetic association tests in genome-wide association studies (GWAS) and sequencing association studies, for outcomes with distribution in the exponential family (e.g. binary outcomes) based on generalized linear mixed models (GLMMs). It can be used to analyze genetic data from individuals with population structure and relatedness. GMMAT fits a GLMM with covariate adjustment and random effects to account for population structure and familial or cryptic relatedness. For GWAS, GMMAT performs score tests for each genetic variant. For candidate gene studies, GMMAT can also perform Wald tests to get the effect size estimate for each genetic variant. For rare variant analysis from sequencing association studies, GMMAT performs the variant Set Mixed Model Association Tests (SMMAT), including the burden test, the sequence kernel association test (SKAT), SKAT-O and an efficient hybrid test of the burden test and SKAT, based on user-defined variant sets. See user manual here.References:
Multiple computational approaches have been developed to improve our understanding of genetic variants. However, their ability to identify rare pathogenic variants from rare benign ones is still lacking. Using context annotations and deep learning methods, we present pathogenicity prediction models, MetaRNN and MetaRNN-indel, to help identify and prioritize rare nonsynonymous single nucleotide variants (nsSNVs) and non-frameshift insertion/deletions (nfINDELs). We use independent test sets to demonstrate that these new models outperform state-of-the-art competitors and achieve a more interpretable score distribution. Importantly, prediction scores from both models are comparable, enabling easy adoption of integrated genotype-phenotype association analysis methods. All pre-computed nsSNV scores are available at The stand-alone program is also available at -Li2019/MetaRNN.
Because experimentally validating the effects of these variants is highly time-consuming and costly, computational approaches have been developed for this purpose [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]. These methods can be loosely categorized into three groups: functional prediction methods, which model the functional importance of the variants; conservation-based methods, which use evolutionary data to identify functional regions and variants; and ensemble methods, which combine multiple individual prediction tools into a single more powerful predictor. While these methods have been widely used to predict potentially pathogenic variants, there are still two significant limitations in their application to whole-exome sequencing studies. First, most of these methods either deployed models trained with rare pathogenic and common benign variants or ignored the importance of observed allele frequencies as features, leading to less optimized performance for separating rare pathogenic and rare benign variants. Second, most methods provide prediction scores for only nsSNVs or incomparable scores for nsSNVs and nfINDELs separately, making it infeasible to use these scores as weights in an integrated (nsSNV+nfINDELs) burden test for genotype-phenotype association analysis.
This study developed the MetaRNN and MetaRNN-indel models to overcome these limitations, enabling users to easily annotate and score both nsSNVs and nfINDELs. As predictive features, our classifiers combine recently developed independent prediction algorithms, conservation scores, and allele frequency information from the 1000 Genomes Project (1000GP) [19], ExAC [20], and gnomAD [21]. Annotations from flanking 1 codon of nucleotides around the target variants were extracted by bidirectional gated recurrent units [22] (GRUs). We trained our recurrent neural network (RNN) model with 26,517 nsSNVs (absent from at least one of the three population datasets, namely gnomAD, ExAC, and 1000GP) and 1981 nfINDELs reported in ClinVar [23] on or before 20190102. To evaluate the performance of the proposed models, we compared multiple state-of-the-art computational methods using independent test sets constructed from well-known variation-disease association databases, i.e., ClinVar [23] and HGMD [24], a TP53 functional mutation dataset [25], and a dataset of potential cancer driver variants [26]. Our results suggest that utilizing flanking region annotations helps boost model performance for separating rare pathogenic variants versus rare (and common) benign variants. In addition, we provide pre-computed MetaRNN scores for all possible human nsSNVs available at [27, 28]. A GitHub page for a stand-alone annotation software package for both nsSNVs and nfINDELs is available at -Li2019/MetaRNN [29].
We applied a recurrent neural network with gated recurrent units [22] (GRU) to extract and learn the context information around target variants (Fig. 1). Bayesian Hyperparameter Optimization [43] was used to determine the best-performing model structure from a wide range of model structures. Specifically, the input layer takes an 11 28 matrix as input for the MetaRNN and a 58 28 matrix for the MetaRNN-indel model. After the bidirectional GRU layer, the MetaRNN model cropped out the context information, and only the learned features for the target variant were kept. This setup can significantly reduce the number of parameters compared to keeping all context features in the subsequent dense layer. Following the same idea, for MetaRNN-indel, the output for the last bidirectional GRU layer only returns the prediction for the final locus (return_sequences parameter was set to false) to limit the number of possible parameters in the following dense layer. The output layer is composed of 1 neuron with a sigmoid activation to model our binary classification problem. A binary cross-entropy loss was used as the loss function, and the Adam optimizer [44] was used to update model parameters through backpropagation [45]. This process used 70% of the training data for model training and 30% of the training data for performance evaluation, so no test sets were used in this step. The Python packages sci-kit-learn ( -learn.org/stable/) [46] and TensorFlow 2.0 ( ) [47] were used to develop the models, and KerasTuner ( -team.github.io/keras-tuner/) [48] was adopted to apply Bayesian Hyperparameter Optimization. The search space for all the hyperparameters is shown in Additional file 1: Table S11. The models with the smallest validation log loss were used as our final models for nsSNV (MetaRNN) and nfINDEL (MetaRNN-indel).
The MetaRNN and MetaRNN-indel models used an ensemble method that combined 24 individual prediction scores and four allele frequency (AF) features from the 1000 Genomes Project (1000GP) [19], ExAC [20], and gnomAD [21]. As shown in Fig. 2a, most of the component conservation scores and ensemble scores showed moderate to strong correlations (correlation coefficient between 0.4 and 1). However, MutationTaster [10] and GenoCanyon [39] showed a weak correlation with all other features. Since most SNVs are not observed in multiple populations (AF=0), correlations between different AF features are also strong (>0.8). AF features showed a weak correlation with all other individual predictors, implying that previous annotation scores have not fully exploited such allele frequency information. This observation is also supported by the feature importance analysis (Fig. 2b). The most important feature is the VEST4 score, which was trained on rare pathogenic variants and common benign variants. The ExAC and gnomAD exome AFs were ranked as the second and third most important features, while AF information from the 1000 GP and gnomAD whole-genome sequencing studies were ranked fifth and sixth, respectively. This observation is in concordance with previous observations [6], highlighting the importance of population AF data in inferring the functional significance of nsSNVs. With the increasing availability of population-based studies, these new AF-based features can complement earlier developed functional annotation tools, such as VEST4.
Additionally, we explored the score distributions for nsSNVs and nfINDELs used in testing the respective MetaRNN models. A clear bimodal distribution was observed for both MetaRNN and MetaRNN-indel predictions (Fig. 5b). Based on a cutoff value of 0.5 as inherited by the sigmoid activation function used in the MetaRNN models, pathogenic nsSNVs and nfINDELs can be effectively separated from benign ones. To further check for compatibility of predictions from both models, defined as the trend that similar prediction scores convey a similar likelihood of being pathogenic, we constructed a combined dataset with 828 randomly sampled prediction scores from the RNTS by MetaRNN and 828 prediction scores from the ClinVar test set for nfINDELs by MetaRNN-indel. We hypothesize that if these two scores are compatible, the AUC calculated using the combined data will perform similarly to the individual AUCs. As shown in Additional file 2: Fig. S9, the combined predictions had an AUC equal to 0.9379, similar to those observed using predictions from individual models (MetaRNN AUC=0.9322, MetaRNN-indel AUC=0.9378). These observations have two implications. First, using a cutoff of 0.5 is in accordance with the interpretation of the scores as probabilities, where a score greater than 0.5 can be categorized as having a higher probability of being pathogenic and a score less than 0.5 can be categorized as having a higher probability of being benign. Second, with a shared cutoff value and similar distributions for nsSNV and nfINDEL scores across independent test sets, we show that predictions from our two models, namely, MetaRNN and MetaRNN-indel, are comparable. This feature can effectively help increase the power of genotype-phenotype association studies and related gene-set association analyses. It can also help fine-mapping the exact causal variants in coding sequences.
aa06259810