Project ideas: lipidomics?

18 views

Skip to first unread message

Nick Nuar

unread,

Sep 14, 2011, 9:00:55 PM9/14/11

to MLSP Fall 2011

Would you be interested in this:
My boss Judith Klein suggested the following lipidomics machine
learning project as something good as a project. In short, we have
examples of normal cells and cells where different diseases have been
induced. These cells are chopped up (lysed) and the fatty parts
(lipids) are fed into a mass spectrometer. The mass spectrometer
finds thousands of lipid chemicals in each sample. In short, our task
is to extract salient features then construct a classifier which can
reliably detect different disease states. This is basic research, and
the type of task is expected to lead toward diagnosis of various human
diseases involving lipid signaling and metabolism. Our goals for the
project would be a modest demonstration of the technique. The full
breadth of the subject will take significantly more than 10 weeks to
comprehend.

Here is a broader task than the class project for reference:
MS data processing. MS data processing will involve two stages, a
preprocessing step and a subsequent statistical analysis step [1]. For
step 1: preprocessing, several specific lipid data bases and software
packages to process raw lipidomics data such as LipidView™,
MarkerviewTM, MZmine and others have been developed [2]. The raw X-
Calibur output data will be processed with MZmine software [3], an
open source Java-based data processing tool for the molecular
characterization and quantification of lipid species from ESI-MS
data. The LIPID MAPS data base [4] will be additionally used for
identification of CL molecular species. MZmine implements several key
methods for data processing including spectral filtering, peak
picking, isotope correction, alignment of samples, quantification and
data export. First we will perform chromatographic and spectral peak
picking to find CL peaks in complex samples. We will normalize the
data via appropriate internal standards. We will correct for the
contribution of second isotope peak [M+2] of one CL molecular species
to the peak area of CL that is two mass units higher, that is, a CL
with one fewer double bond. We will align mass and retention time to
compensate for minor variation, ensuring that identical CL molecular
species in different samples are accurately compared. We have already
implemented a number of in-house scripts to automate typical work-
flows relating to identification of CL-related data analysis. MZmine
automatically identifies masses and outputs them to an excel sheet,
from which it is challenging to identify those of particular interest,
for example CL-related species, for which we wrote GUI modules in
python language to automatically process the output from Mzmine. The
processed data will form the input to step 2: statistical analysis, in
which the comparison between datasets will be performed. Due to the
complexity and high dimensionality of the data, multivariate
statistical methods have found wide-spread use in the analysis of
lipidomics data. We will use the DanteR software suite [5] to perform
principal component analysis (PCA), partial least-squares discriminate
analysis (PLS-DA) and analysis of variance (ANOVA). All of these and
related approaches have been successfully applied to lipidomics
biomarker discovery [6, 7]. These methods will be applied to datasets
obtained first from SH-SY5Y and primary rat cortical neuron cells with
and without treatment with rotenone. We will use these datasets to
test and optimize the analysis to identify the lipid species that most
significantly differentiate between rotenone treated and untreated
cells (Aim 1). The optimized approach will then be applied to the data
obtained from the rotenone-infusion rat model (Aim 3), and finally to
human peripheral blood lymphocytes (Aim 4). The deliverable of the
statistical analysis will be a list of candidate lipid biomarkers that
best discriminate rotenone-treated from untreated conditions. Finally,
we will conduct post-data analysis studies aimed at integrating the
lipidomic analysis with genomic, proteomic and metabolomics
information to derive and validate hypotheses related to the
underlying mechanism for alterations in lipid species, by using a
number of knowledge databases useful to generate an integrated picture
involving the related signal transduction pathways [8]. For example,
it was shown that biomarkers linked to radiation damage in rats are
related to cell apoptosis pathways [7]. This approach will support the
efforts described in Aim 2.
Potential pitfalls and alternatives. The identification of candidate
biomarkers from lipidomics datasets is an important task in a number
of diseases and there have been successful candidates for example in
cancer, diabetes, Alzheimer’s and other diseases [8]. Thus, it is
likely that the proposed approach will result in a list of biomarker
candidates that can be validated experimentally. However, often the
most successful approaches are those that incorporate prior knowledge
and expertise [8]. Thus, if discrimination of rotenone treated and
untreated conditions proves difficult, we may take a more complex
approach, in which we investigate modifications to both steps 1 and 2,
described above. There are many different ways in which the processing
of the raw data can be improved, for example through signal processing
techniques such as noise reduction through wavelet transforms. We have
used such techniques previously in sequence analysis [9] and they have
been successfully applied to MS data processing [1]. The MS data can
be converted into feature vectors that form the input into step 2 in
numerous ways, including raw m/z ratios, identities of molecular
species, peak related parameters etc. Data integration from available
lipidomics databases will provide us with additional features such as
pathway membership, molecular family etc. These features and different
ways of encoding them can be explored to improve performance. In step
2, there are numerous other statistical approaches that we can test.
In particular, we can formulate the task as a binary classification
task in which the label is “rotenone-treated” or “not”. We will
explore different classification algorithms including Naïve Bayes,
decision tress, support vector machines (with different Kernels), and
random forest method to optimize predictive performance. We have
extensive experience in integrating diverse and large biological
datasets for classification tasks (see list of publications, e.g.
[10]). Finally, in case the feature space is too large and overfitting
is problematic (a known issue in proteomics data analysis [1]), the
above approaches can also be combined with feature selection, which
can significantly improve performance in complex biological prediction
tasks, as in our previous study of protein family classification [11].
Finally, it should be pointed out that while we expect similarities
between different sources of material, in principal the analysis of
each dataset is independent of each other. Thus, biomarkers in cell
lines may be different from biomarkers in rats, and the methods are
independent of whether or not this is true. However, we expect that we
can learn much from the analysis of the cell lines in Aim 1, that will
inform and speed up progress in the analysis of the datasets from Aims
2 and 3.

References
1. Barla, A., et al., Machine learning methods for predictive
proteomics. Brief Bioinform, 2008. 9(2): p. 119-28.
2. Blanksby, S.J. and T.W. Mitchell, Advances in mass spectrometry for
lipidomics. Annu Rev Anal Chem (Palo Alto Calif), 2010. 3: p. 433-65.
3. Katajamaa, M. and M. Oresic, Processing methods for differential
analysis of LC/MS profile data. BMC Bioinformatics, 2005. 6: p. 179.
4. LipidMaps. www.lipidmaps.org. Available from: www.lipidmaps.org.
5. DanteR, http://omics.pnl.gov/software/DanteR.php.
6. Bergheanu, S.C., et al., Lipidomic approach to evaluate
rosuvastatin and atorvastatin at various dosages: investigating
differential effects among statins. Curr Med Res Opin, 2008. 24(9): p.
2477-87.
7. Wang, C., J. Yang, and J. Nie, Plasma phospholipid metabolic
profiling and biomarkers of rats following radiation exposure based on
liquid chromatography-mass spectrometry technique. Biomed Chromatogr,
2009. 23(10): p. 1079-85.
8. Hu, C., et al., Analytical strategies in lipidomics and
applications in disease biomarker discovery. J Chromatogr B Analyt
Technol Biomed Life Sci, 2009. 877(26): p. 2836-46.
9. Ganapathiraju, M., et al., Transmembrane helix prediction using
amino acid property features and latent semantic analysis. BMC
Bioinformatics, 2008. 9 Suppl 1: p. S4.
10. Qi, Y., et al., Systematic prediction of human membrane receptor
interactions. Proteomics, 2009. 9(23): p. 5243-55.
11. Cheng, B.Y., J.G. Carbonell, and J. Klein-Seetharaman, Protein
classification based on text document classification techniques.
Proteins, 2005. 58(4): p. 955-70.

Reply all

Reply to author

Forward

0 new messages