HDXmass spectrometry is a powerful platform to probe protein structure dynamics during ligand binding, protein folding, enzyme catalysis, and such. HDX mass spectrometry analysis derives the protein structure dynamics based on the mass increase of a protein of which the backbone protons exchanged with solvent deuterium. Coupled with enzyme digestion and MS/MS analysis, HDX mass spectrometry can be used to study the regional dynamics of protein based on the m/z value or percentage of deuterium incorporation for the digested peptides in the HDX experiments. Various software packages have been developed to analyze HDX mass spectrometry data. Despite the progresses, proper and explicit statistical treatment is still lacking in most of the current HDX mass spectrometry software. In order to address this issue, we have developed the HDXanalyzer for the statistical analysis of HDX mass spectrometry data using R, Python, and RPY2.
Statistical analysis provides crucial evaluation of whether a protein region is significantly protected or unprotected during the HDX mass spectrometry studies. Although there are several other available software programs to process HDX experimental data, HDXanalyzer is the first software program to offer multiple statistical methods to evaluate the changes in protein structure dynamics based on HDX mass spectrometry analysis. Moreover, the statistical analysis can be carried out for both m/z value and deuterium incorporation rate. In addition, the software package can be used for the data generated from a wide range of mass spectrometry instruments.
The integration of statistical analysis with data processing is challenging. In terms of statistical analysis, several software environment including SAS, SPSS, and R can be employed. Among these packages, R is the open source program and can be easily obtained from internet free of charge. Despite the various advantages of R, the software environment does not have strong user-interface supports and thus requires certain level of expertise. In order to develop user-friendly statistical software for HDX mass spectrometry analysis, we hereby employ the latest RPY2 package to connect the statistical module of R with a data processing module implemented by Python and a user-interface implemented by wxpython. Many programming languages including C, java, Perl, Python can be used for data processing and UI development, and each programming language has its pros and cons. Among these programming languages, Python is chosen as the main developing language for two reasons: First, the existed RPY package allowed the seamless and effective implementation between the data processing module and statistical module in R. Second, various BioPython packages have been developed for the analysis of biological data which allows fast and easy embedding. For these reasons, we have developed the HDX mass spectrometry analysis software HDXanalyzer using Python, R, and RPY2 packages.
As shown in Figure 1, HDXanalyzer includes several modules, data processor, statistical analysis module, and user interface. HDX mass spectrometry usually involves two types of data, the MS/MS analysis for peptide sequence identification and the MS analysis for m/z value of the peptide peaks in different protein formats and statuses (i.e. apo form, ligand bound form, proteins that has been subjected to different HDX exchange time points). The MS/MS analysis often allows us to correlate the peptide ID with a sequence, which is beyond the scope of HDXanalyzer. HDXanalyzer is mainly dealing with the MS analysis data. The raw MS data was exported for intensity and m/z values in the excel format for each peptide as shown in the online Additional Files available at syuan/hdxanalyzer. The Additional Files contain the sample input data derived from HDX mass spectrometry analysis of xylanase enzyme and the executable file for HDXanalyzer software. The data is in a compressed package available at the aforementioned link. The HDXanalyzer then takes all the pre-formatted excel file containing peptide ID and m/z values of the peptides at different HDX exchange time points as the input. A python package for reading and formatting information from excel files named as xlrd was used to develop a parser to extract and process excel spreedsheets. The parser allows us to extract all m/z values and intensity of the peptides. The peptide ID and the treatment (apo or ligand bound) will also be extracted.
The data processing can derive two types of variables for the statistical analysis, centroid of the peptide in form of m/z value or deuterium incorporation rate. Centroid values of each peptide are derived based on the m/z value of the peaks generated by MS analysis. For deuterium incorporation rate, the weighted average m/z values of each peptide ion isotopic cluster are calculated. Basically, the deuterium incorporation rate is calculated based on the centroid of the peptide m/z value and is in form of percentage. The deuteration level of each peptide is calculated based on Equation 1, and corrections for back-exchange are made based on 70% deuterium recovery and accounting for 80% deuterium content in the ion-exchange buffer. These corrections can be defined by users in the data pre-processing procedure.
The resulted data are then processed into a table format and loaded to Figure Generator to create visualization of the dynamic status of peptide at different time points. Specifically, the figures are graphs displaying the m/z value or deuterium incorporation rate of a peptide at different exchange time. Gnuplot, an open source GNU plotting tool under UNIX/linux, with counterpart in MSDOS & Windows system, was employed to implement the Figure Generator. The advantages of Gnuplot lies in two aspects including the availability from either GNU projects or internet free of charge, as well as the convenience of automated generating multiple outputs using its corresponding scripting language. Besides the graphic display of the HDX data, statistical analysis is carried out to generate the point estimation for differential m/z value or incorporation rate, the confidence intervals and the p value.
Statistical analysis is employed to evaluate if a peptide or a region of the protein has significant changes in structure dynamics or not. Such changes are reflected in the differences of either centroid m/z values or the deuterium incorporation rates during the HDX experiments. The m/z value or deuterium incorporation rate from different peptides can be compared with different statistical models to derive parameter estimation and p value. The parameter estimations allow us to evaluate the levels and variations of the differences in structure dynamics of a protein region, and the p value allows us to determine if the differences are significant or not.
where Y is the dependent variable that can be either the m/z value or the deuterium incorporation rates of different peptides. Y is dependent on the effects of time points and different groups from either apo or ligand bound proteins. The combination of the two effects may also influence the dependent variable.
The integration of statistical analysis, data processing, and visualization is usually challenging. The recent developed RPY allows us to integrate the statistical feature of R and the user interface as well as the data processing features of Python. As an open-source language, R has the unique advantages over other statistical languages for software development. RPY enables us to employ the R for statistical analysis of HDX mass spectrometry data. We have also used RPY2 to provide a low-level interface to R. The Python-based system thus can directly call R function through RPY and the software efficiency and effectiveness are greatly improved.
The user interface of HDXanalyzer is developed with Wxpython as shown in Figure 1 and 2. Wxpython is a Python extension model, which works as a wrapper for the cross-platform GUI API wxWidgets for the Python programming language. We have developed the user interface including a menu bar, a tool bar, and four windows. The four windows are data manager window, figure browser, enlarged figure, and statistical analysis windows. Data manager window shows the spreadsheets (or the peptides) for the data analysis. Figure browser window shows a list of the graphs for comparing the m/z value or deuterium incorporation rate for all peptides listed in the data manager window. All of the graphs are clickable and can be viewed in the enlarged figure window. The statistical analysis window displays the statistical analysis results.
HDXanalyzer is implemented as a software package to enable the statistical analysis of HDX mass spectrometry data and to allow the evaluation of protein structure dynamics changes. In order to demonstrate the software application, we first analyze a previously published dataset for the HDX mass spectrometry analysis of xylanase enzyme. The example data is available in the online supplementary document. Furthermore, we also apply HDXanalyzer to analyze two peptides from a recent publication, where the HD exchange for the two peptides were statistically evaluated. We hereby discuss the usage of the software, present the output, compare the different results from different statistical models, and interpret the results.
The HDXanalyzer aims to integrate statistical analysis for comparing structure dynamics of protein upon ligand or substrate binding. As discussed in the Implementation section, the software takes a batch of pre-formatted excel files containing m/z values for multiple peptides of different treatment and time points as shown in Supplementary File 1 available online (the HDX_Xylohexaose.rar dataset). The data pre-formatting will allow the software to process a uniform input of HDX mass spec data from different instruments. The sample input file is derived from a xylanase structure dynamics study and the m/z values of the peak area for the peptides are included. Each input excel file will contain several sheets for the data from different peptides and treatments. The spreadsheet contains peptide ID, m/z value, charge state, time points for deuterium treatment, and the ligand name to separate different experimental sets, e.g., apo set and ligand set. The peptide ID can be corresponding to a certain peptide sequence. The upload function is available from the user interface, where input file can be read and processed to generate m/z centroids and deuterium incorporation rates of the peptides as aforementioned. The data are therefore further analyzed for visualization and statistical analysis.
3a8082e126