The next session of the survey course is starting:
2612 Introduction to Machine Learning and Data Mining
Sign up here: Machine Learning / Data Mining Survey Course
Dates: Jan 31, 2012 - Apr 3, 2012 6:30 - 9:30 PM First nine sessions on Tuesdays, while the tenth
session will meet on a Monday
Location: 2505 Augustine Drive
Santa Clara, CA
The Spring
schedule for the Association for Computing Machinery Data Mining SIG is
detailed below.
This group meets on the fourth Monday of each month at LinkedIn:
2025 Stierlin Ct.
Mountain View, CA 94043
Feb 27
Ron Bekkerman, LinkedIn
March 26
Michael Mahoney, Stanford
April 23
Lionel Jouffie, Bayesia Labs
May 28
Giovanni Seni, Intuit
Feb:
Title:
Scaling Up Machine Learning: Parallel and Distributed Approaches
Abstract:
In this talk, I'll provide an extensive introduction to parallel and distributed machine learning. I'll answer the questions "How actually big is the big data?", "How much training data is enough?", "What do we do if we don't have enough training data?", "What are platform choices for parallel learning?" etc. Over an example of k-means clustering, I'll discuss pros and cons of machine learning in Pig, MPI, DryadLINQ, and CUDA. Time permitting, I'll take a deep dive into parallel information-theoretic clustering.
Bio:
Ron Bekkerman is a senior research scientist at LinkedIn where he develops machine learning and data mining algorithms to enhance LinkedIn products. Prior to LinkedIn, he was a researcher at HP Labs. Ron completed his PhD in Computer Science at the University of Massachusetts Amherst in 2007. He holds BSc and MSc degrees from the Technion---Israel Institute of Technology. Ron has published on various aspects of clustering, including multimodal clustering, semi-supervised clustering, interactive clustering, consensus clustering, one-class clustering, and clustering parallelization.
March
Randomized algorithms for matrices and data
Michael
W. Mahoney
Randomized algorithms for very large matrix
problems (such as matrix
multiplication, least-squares regression, the
Singular Value
Decomposition, etc.) have received a great deal
of attention in recent
years. Much of this work was motivated by
problems in large-scale
data analysis; this approach provides a novel
paradigm and
complementary perspective to traditional
numerical linear algebra
approaches to matrix computations; and the
success of this line of
work opens the possibility of performing
matrix-based computations
with truly massive data sets. Originating
within theoretical computer
science, this work was subsequently extended and
applied in important
ways by researchers from numerical linear
algebra, statistics, applied
mathematics, data analysis, and machine
learning, as well as domain
scientists.
In this talk, we will provide an overview of
this approach, with an
emphasis on a few simple core ideas that
underlie not only recent
theoretical advances but also the usefulness of
these tools in
large-scale data analysis applications.
Crucial in this context is
the connection with the concept of statistical
leverage.
Historically, this notion, and in particular the
diagonal elements of
the so-called hat matrix, has been used in
regression diagnostics to
identify errors and outliers. Recently,
however, the connection with
statistical leverage has proved crucial in the
development of improved
matrix algorithms that come with worst-case
guarantees, that are
amenable to high-quality numerical
implementation, and that are also
useful to domain scientists. These
developments, how to approximate
very precisely the statistical leverage scores
in time qualitatively
faster than the usual naive method, and an
example of how these ideas
can be applied in large-scale distributed and
parallel computational
environments will all be described.
BIO:
Michael Mahoney is
at Stanford University. His research interests
center around algorithms for very large-scale
statistical data
analysis, including both theoretical and applied
aspects of problems
in scientific and Internet domains. His
current resear ch interests
include geometric network analysis; developing
approximate computation
and regularization methods for large informatics
graphs; and
applications to community detection, clustering,
and information
dynamics in large social and information
networks. He has also worked
on randomized matrix algorithms and their
applications to genetics,
medical imaging, and Internet problems. He
has been a faculty member
at Yale University and a
researcher at Yahoo, and his PhD was is
computational statistical mechanics
at Yale University.
April:
Title:
Introduction to Bayesian Belief Networks and their Applications
Presenter: Dr. Lionel Jouffe, cofounder and CEO of France-based Bayesia S.A.S.
Lionel Jouffe received the Ph.D. degree in
Computer Science from the Université of Rennes I, Rennes, France, in
1997. After one year dedicated to the industrialization of the results of his
Ph.D. research (Fuzzy Inference System learning by Reinforcement methods –
automatic pig house atmosphere controller), he received the Inov’Space Award
and the medal of the town of Rennes.
He joined the ESIEA as a Professor/Researcher in
1998 and began his research on Bayesian network learning from data.
Lionel then co-founded Bayesia in 2001, a
company specialized in Bayesian networks technology. He and his team have been
developing BayesiaLab since 1999 and it has emerged as the leading software
package for knowledge discovery, data mining and knowledge modeling using
Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities
as well as in business and industry. The relevance of Bayesian networks,
especially in the context of consumer research, is highlighted by Bayesia’s
strategic partnership with Procter & Gamble, who has deployed BayesiaLab
globally since 2007
Abstract: Bayesian Belief networks have emerged as a new form of probabilistic
knowledge representation and probabilistic inference engine through the seminal
works of UCLA Professor Judea Pearl. Over the last 25 years the properties of
Bayesian networks have been fully validated in the world of academia and they
are now becoming powerful and practical tools for “deep understanding” of very
complex, high-dimensional problem domains. Their computational efficiency and
inherently visual structure make Bayesian Belief networks very attractive for
Expert Knowledge Modeling, Data mining, and Causal Analysis.
This tutorial will provide an introduction to
the wide-ranging applications of Bayesian Belief networks. Participants do not
need to have any prior familiarity with Bayesian Belief networks. We will start
the seminar by illustrating the conceptual foundations using several textbook examples.
This will include an overview of unsupervised learning (knowledge discovery),
supervised learning (dependent variable characterization), data clustering
(segmentation), variable clustering (to find hidden concepts), and
Probabilistic Structural Equation Models (mainly applied for drivers analysis).
Bayesia will provide all participants with an
unrestricted 30-day license of BayesiaLab 5.0 Professional Edition, so they can
participate in exercises on their own laptops.
May:
Title: "Advances in Regularization: Bridge Regression and Coordinate Descent Algorithms."
"A widely held principle in
Statistical model inference is that accuracy and simplicity are both desirable.
But there is a tradeoff between the two: a flexible (more complex) model is
often needed to achieve higher accuracy, but it is more susceptible to
overfitting and less likely to generalize well. Regularization techniques “damp
down” the flexibility of a model fitting procedure by augmenting the error
function with a term that penalizes model complexity. Minimizing the augmented
error criterion requires a certain increase in accuracy to "pay" for
the increase in model complexity (e.g., adding another term to the model). This
talk offers a concise introduction to this topic and a review of recent
developments leading to very fast algorithms for parameter estimation with
various types of penalties. It concludes with an example in R, showing an
application of the techniques to a document classification task with 1-Million
predictors."
Bio:
Giovanni Seni is currently a Senior Data Scientist with Intuit. As an active data mining practitioner in Silicon Valley, he has over 15 years R&D experience in statistical pattern recognition, data mining, and human-computer interaction applications. He has been a member of the technical staff at large technology companies, and a contributor at smaller organizations. He holds five US patents and has published over twenty conference and journal articles. His book with John Elder, "Ensemble Methods in Data Mining - Improving accuracy through combining predictions", was published in February 2010 by Morgan & Claypool. Giovanni is also an adjunct faculty at the Computer Engineering Department of Santa Clara University, where he teaches an Introduction to Pattern Recognition and Data Mining class.
--
Patricia Hoffman PhD