PySpark SGD using scikit-learn (cross-posted from sklearn group)

MLnick

unread,

Feb 4, 2013, 9:59:31 AM2/4/13

to spark...@googlegroups.com

Hi,

Just thought some may be interested in a thread on the sklearn mailing list, regarding a very simple SGD example I worked up using the Python API and sklearn's SGD module, implementing an alternate version of logistic regression - although using sklearn, all the model attributes can be set (e.g. learning rate, loss type for different objective functions etc):

http://sourceforge.net/mailarchive/forum.php?thread_name=CALD%2B6GP16wbr-CewzabdMTfCf8qxjUD0KX21fDnKdiWqRiUMFQ%40mail.gmail.com&forum_name=scikit-learn-general

=====

@Robert sorry for the delay in responding, I was away on vacation.

Here's a link to a gist of a very simple implementation of parallelized SGD
using Spark (https://gist.github.com/4707012). It basically replicates the
existing Spark logistic regression example, but using sklearn's
linear_model module. However the approach used is iterative parameter
mixtures (where the local weight vectors are averaged and the resulting
weight vector rebroadcast) as opposed to distributed gradient descent
(where the local gradients are aggregated, a gradient step taken on the
master and the weight vector rebroadcast) - see
http://faculty.utpa.edu/reillycf/courses/CSCI6175-F11/papers/nips2010mannetal.pdffor
some details.

This is partly because sklearn doesn't give access to the gradients in any
case (as far as I can tell), but does give access to params (.coef_ and
.intercept_), and partly because IPM appears superior in terms of
wall-clock speed in the paper with equivalent accuracy, at least for SGD.

(As an side, interestingly Vowpal Wabbit's standard approach is 1 pass of
SGD with (weighted) averaging and then distributed gradients but using
LBFGS on each node, which gives quick convergence to a good solution with
SGD, then getting the rest of the way to the "best" solution with LBFGS).

As you can see, this simple version of cluster-distributed SGD with sklearn
and Spark inherits a lot of sklearn's power (e.g. we can set learning
rates, loss types for classification / regression, etc), with the only
additional code needed being a training function and one for merging models.

Nick

On Sun, Jan 27, 2013 at 8:01 PM, Robert Kern <robert.kern@...> wrote:

> On Thu, Jan 24, 2013 at 10:06 AM, Nick Pentreath
> <nick.pentreath@...> wrote:
> > May I suggest you look at Spark (http://spark-project.org/ and
> > https://github.com/mesos/spark).
> >
> > It is written in Scala, has a Java API and the current master branch has
> the
> > new Python API (0.7.0 release when it happens). I've been doing some
> > testing, including using sklearn together with Spark, and so far it looks
> > good. The bonus is no Hadoop MapReduce (but fully HDFS compatible if you
> > need the filesystem), and you can write all your code directly in Python.
>
> I've been keeping an interested eye on the Spark project for a while
> now. Can you share any sklearn+Spark examples that you've worked up so
> far?
>
> --
> Robert Kern

Matei Zaharia

unread,

Feb 5, 2013, 3:37:10 PM2/5/13

to spark...@googlegroups.com

Cool, thanks for posting this! It's nice to see that Python libraries can be used in here. I don't know much about the Python ML libraries myself but it would be great to try to interoperate with them in PySpark.

Matei

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

MLnick

unread,

Feb 6, 2013, 8:13:47 AM2/6/13

to spark...@googlegroups.com

I think scikit-learn is the natural starting point for building ML on PySpark, since it already has heavily optimised code for SGD, k-means etc (usually written in Cython or wrappers for C libraries like liblinear). They also have some useful feature extraction code (text, tfidf encoding and now in 0.13 feature hashing!). The API is fairly consistent too - each algorithm has a "fit" method and predict methods etc.

One downside is the majority of the library is focused on in-core, single instance algorithms. For online variants (that are most easily parallelised), there are SGD for linear models and mini-batch k-means that support the "partial_fit" method.

As an aside, does anyone know of a large binary (or multinomial) classification dataset that could serve to benchmark? Many of the papers use multi-billion row clickstream datasets but usually they are internal to Google, Yahoo etc.

N

Stoney Vintson

unread,

Feb 10, 2013, 3:31:50 PM2/10/13

to spark...@googlegroups.com

You might look through the Amazon AWS Public Data Sets or ask this question on the Kaggle.com forum after looking around a bit.

AWS Public Data sets list

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1

AWS Public Data sets forum

https://forums.aws.amazon.com/forum.jspa?forumID=55

Kaggle Forums
https://www.kaggle.com/forums

InfoChimps Datasets ( They have their own platform. Maybe they would offer Spark on their platform )

http://www.infochimps.com/datasets

MLnick

unread,

Feb 11, 2013, 2:45:15 AM2/11/13

to spark...@googlegroups.com

I did previously take a look through Kaggle and AWS but didn't turn up anything - this time I looked again through the older competitions and found KDD2012: https://www.kddcup2012.org/c/kddcup2012-track2 which looks perfect!