Sprinting at PyCon US 2012 in Santa Clara in March.

Olivier Grisel

unread,

Dec 6, 2011, 6:02:14 AM12/6/11

to scikit-learn-general, pystat...@googlegroups.com

Hi all,

My tutorial on scikit-learn at PyCon has been accepted. Would anybody
be interested in sprinting there? The sprint days are Mar. 12-15.

http://us.pycon.org/2012/

I think Wes has submitted a talk on Pandas too.

I would be very interested in sprinting on machine learning & data
analytics in the cloud using partitioned memory mapped arrays to
prototype a low overhead alternative to the Hadoop MapReduce runtime
optimized for numerical data and in-memory iterative processing,
probably leveraging IPython.parallel and POSIX sendfile [1].

Some Pandas idioms like groupBy and alignment would be interesting to
investigate in a distributed setting IMHO.

[1] http://linux.die.net/man/2/sendfile

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Fernando Perez

unread,

Dec 6, 2011, 6:11:09 AM12/6/11

to pystat...@googlegroups.com, scikit-learn-general

On Tue, Dec 6, 2011 at 3:02 AM, Olivier Grisel <olivier...@ensta.org> wrote:
> My tutorial on scikit-learn at PyCon has been accepted. Would anybody
> be interested in sprinting there? The sprint days are Mar. 12-15.
>
> http://us.pycon.org/2012/
>
> I think Wes has submitted a talk on Pandas too.

Min and I will be there too, presenting a tutorial on ipython
(including the parallel stuff). So we should definitely be able to
have some fun on this :)

Cheers,

f

Olivier Grisel

unread,

Dec 6, 2011, 9:54:12 AM12/6/11

to pystat...@googlegroups.com, scikit-learn-general

2011/12/6 Fernando Perez <fpere...@gmail.com>:

Great!

Wes McKinney

unread,

Dec 6, 2011, 8:51:10 PM12/6/11

to pystat...@googlegroups.com, scikit-learn-general

I'll be there too. I have a tutorial on pandas and I'll additionally
be in town the week before for Strata. Guess it's going to be a couple
of pleasant weeks on the West Coast.

I think we could put together some great stuff on the large-scale data
processing front-- it's been high on my list of things to do lately as
I've been working on pandas. Generally with data processing tools
there's always a long list of things that can be done.

Looking forward to it already!

best,
Wes

Olivier Grisel

unread,

Dec 7, 2011, 3:22:00 AM12/7/11

to scikit-lea...@lists.sourceforge.net, pystat...@googlegroups.com

Excellent,

That makes us 5 people interested on the same topic, I think this is
going to be a great sprint.

--
Olivier

Olivier Grisel

unread,

Dec 7, 2011, 9:28:15 AM12/7/11

to scikit-lea...@lists.sourceforge.net, pystat...@googlegroups.com

2011/12/7 Timmy Wilson <tim...@smarttypes.org>:
> I would love to sit in, and learn, and contribute where i can.
>
> Probably won't have time for this during the sprint -- but i want to
> throw it out there:
>
> The importance of locality in many manifold learning algos them good
> candidates for distribution.

This is interesting but AFAIK there is no established way to achieve
this and this is still an open research problem.

Personally I don't plan to work on panellization of machine learning
algorithms it-self during this sprint but focus more on the
infrastructure. Although to make informed decisions on the infra it's
good to have some motivating and representative use cases in mind that
can be used to validate proof of concepts implementations.

For instance: in the machine learning domain (scikit-learn) we could have:
- sparse coding with a fixed dictionary (embarrassingly parallel)
- distributed fitting a of linear model with SGD & averaging (can be
implemented efficiently with message passing I think).

In the general data-analytics domain (Pandas & statsmodels):
- distributed (& streaming) computation of means, variances and other moments.
- distributed implementation of the alignement 2 datasets (2d tables)
using a common key: e.g. the timecode in a time series.
- distributed implementation of the GroupBy feature of Pandas

Also speaking about scaling machine learning algorithm, the following
blog post titled "Hadoop AllReduce and Terascale Learning" by John
Langord is very interesting:

http://hunch.net/?p=2094

Maybe we should open a wikipage for sprint planning. Fernando shall we
use the IPython wiki on github (if so please enable it)? Otherwise we
can use the scikit-learn wiki that we regularly use for sprint
planning, e.g.:
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events

Timmy Wilson

unread,

Dec 7, 2011, 9:59:25 AM12/7/11

to scikit-lea...@lists.sourceforge.net, helle...@cs.berkeley.edu, pystat...@googlegroups.com

> Also speaking about scaling machine learning algorithm, the following
> blog post titled "Hadoop AllReduce and Terascale Learning" by John
> Langord is very interesting:
>
> http://hunch.net/?p=2094

This one is from Joe Hellerstein of bloom fame -- http://www.bloom-lang.net/

http://databeta.wordpress.com/2011/09/15/is-teaching-mapreduce-healthy-for-students/

"
From an architectural point of view, a good language for parallelism
should expose pipelining, and MapReduce hides it. Brian suggested I
expand on this point somewhere so people could talk about it. So here
we go.
"

Joe's in Berkeley -- maybe he'll join us ;]

> ------------------------------------------------------------------------------
> Cloud Services Checklist: Pricing and Packaging Optimization
> This white paper is intended to serve as a reference, checklist and point of
> discussion for anyone considering optimizing the pricing and packaging model
> of a cloud services business. Read Now!
> http://www.accelacomm.com/jaw/sfnl/114/51491232/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-lea...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Fernando Perez

unread,

Dec 7, 2011, 8:31:23 PM12/7/11

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

Hey,,

On Wed, Dec 7, 2011 at 6:28 AM, Olivier Grisel <olivier...@ensta.org> wrote:
> Maybe we should open a wikipage for sprint planning. Fernando shall we
> use the IPython wiki on github (if so please enable it)?

we decided a while ago not to use the github wiki but instead to have
a separate one (a rare instance where we didn't go with the github
flow); I've made a planning page here:

http://wiki.ipython.org/PyCon12Sprint

feel free to add/edit at will...

Gotta run, but I agree on the value of having some thought put into
scoping out the problem for a few concrete use cases, so we can make
realistic progress and not just flail around...

Cheers,

f

Olivier Grisel

unread,

Dec 8, 2011, 5:35:26 AM12/8/11

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

Ok I have edited the wiki page to add scope, motivation and use case
information for this sprint.

Please feel free to edit and comment:

http://wiki.ipython.org/PyCon12Sprint

In particular I am interested in feedback on which use cases are the
most important to you.

--
Olivier

Olivier Grisel

unread,

Dec 8, 2011, 6:16:53 AM12/8/11

to scikit-lea...@lists.sourceforge.net, pystat...@googlegroups.com

2011/12/8 Satrajit Ghosh <sa...@mit.edu>:
> hi olivier,
>
> thanks for the info. that is very helpful. i would be very curious how you
> plan to make efficient data distribution across nodes of a cluster and
> balance i/o performance. (or is part of the low overhead the notion that you
> are looking at a shared memory architecture primarily).

I don't plan to work on shared memory: partition the input arrays and
send memmappable chunks on each node (maybe with some replication if
that can help better use the available CPUs - task dependent).

For small intermediate and output data, IPython.parallel based message passing.

For large distributed aggregate / shuffle / reduce: I don't have any
fixed ideas but I think a hard drive barrier will be necessary (as
implemented in Hadoop).

Fernando Perez

unread,

Dec 8, 2011, 9:44:36 PM12/8/11

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

This is *excellent*, many thanks!!

Logistical question: do we know anything yet about the
venue/space/etc? Does pycon itself provide sufficient sprint space or
do we need to make arrangements ourselves? Being semi-local I can
look into it if needed, but if it's already been taken care of by
pycon I'll be happy no to worry with logistics.

Cheers,

f

Olivier Grisel

unread,

Dec 9, 2011, 3:17:39 AM12/9/11

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

2011/12/9 Fernando Perez <fpere...@gmail.com>:

Last year in Atlanta there where many rooms available with power
outlet and a good wifi to accommodate all sprinters of the various
projects who decided to setup a sprint. I don't think we need to worry
about this.

Olivier Grisel

unread,

Dec 9, 2011, 4:37:43 AM12/9/11

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

2011/12/9 Olivier Grisel <olivier...@ensta.org>:

I was thinking: it would be great if we could get a ssh access to a
small linux cluster (e.g. 10 nodes) with IPython / numpy / scipy
installed on it at Berkeley for the duration of the sprint so as to be
able to quickly test implementation ideas.

Otherwise we can use EC2 or Rackspace but that will be more expensive.

Fernando Perez

unread,

Dec 9, 2011, 12:53:32 PM12/9/11

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

On Fri, Dec 9, 2011 at 1:37 AM, Olivier Grisel <olivier...@ensta.org> wrote:
> I was thinking: it would be great if we could get a ssh access to a
> small linux cluster (e.g. 10 nodes) with IPython / numpy / scipy
> installed on it at Berkeley for the duration of the sprint so as to be
> able to quickly test implementation ideas.

Good idea; I'll take care of it and will make sure we have enough
guest accounts for that.

Cheers,

f

Olivier Grisel

unread,

Dec 9, 2011, 1:31:38 PM12/9/11

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

2011/12/9 Fernando Perez <fpere...@gmail.com>:

Thanks.

Wes McKinney

unread,

Jan 25, 2012, 1:36:13 PM1/25/12

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

hi Olivier,

do we want to still do a data / statsmodels / scikit-learn sprint at
PyCon? I will be there the first two sprint days, leaving town (after
a very extended stay due to Strata 2 weeks beforehand) on 3/15.

- Wes

Olivier Grisel

unread,

Jan 25, 2012, 1:49:23 PM1/25/12

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

2012/1/25 Wes McKinney <wesm...@gmail.com>:

>
> hi Olivier,
>
> do we want to still do a data / statsmodels / scikit-learn sprint at
> PyCon? I will be there the first two sprint days, leaving town (after
> a very extended stay due to Strata 2 weeks beforehand) on 3/15.

I still think the PyCon venue would be nice to be able to interact
with the rest of the python community (while not sprinting on our own
stuff). Now if everybody else prefer to do it in Berkeley rather than
Santa Clara I would be fine with that too.

Wes McKinney

unread,

Jan 25, 2012, 10:07:26 PM1/25/12

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

I'm happy to do it at PyCon since I assume there will be plenty of
space plus perhaps snacks and definitely camaraderie. Just wanted to
check that "the sprint is on!". Do you want to get something
officially on the schedule or shall I?

- Wes

Fernando Perez

unread,

Jan 26, 2012, 1:36:34 AM1/26/12

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

On Wed, Jan 25, 2012 at 7:07 PM, Wes McKinney <wesm...@gmail.com> wrote:
> I'm happy to do it at PyCon since I assume there will be plenty of
> space plus perhaps snacks and definitely camaraderie. Just wanted to
> check that "the sprint is on!". Do you want to get something
> officially on the schedule or shall I?

We can also do it at Berkeley, but I should note that it's *not*
trivial to secure space for several days in a row for a bunch of
people on campus.

So while for me Berkeley is much more convenient than Santa Clara
transportation-wise, I think I'd rather take advantage of the logistic
support of Pycon than trying to secure space on campus for a
potentially large group...

Wes, you're welcome to add this to the ipython/sklearn one and turn it
into a 'pydata: ipython+sklearn+statsmodels' so that people can flow
between the three tools as desired:

https://us.pycon.org/2012/community/sprints/projects/

You can also use our planning page if you want and adjust it accordingly:

http://wiki.ipython.org/PyCon12Sprint

Since we'll have enough 'core' people from each of the three projects,
if there are participants who want to focus on only one of them, we
can help them out, while the larger objectives remain:

- ipython/sklearn integration for parallel analyses
- sklearn/statsmodels as per this thread.

How does that sound?

Cheers,

f

Olivier Grisel

unread,

Jan 26, 2012, 3:26:52 AM1/26/12

to pystat...@googlegroups.com, scikit-lea...@lists.sourceforge.net

2012/1/26 Fernando Perez <fpere...@gmail.com>:

As a pandas user I would really like to take the opportunity of this
print to work on (or at least discuss the design of) multi-core then
distributed sort / groupby / merge as I mentioned in the original
proposal (which is still on the ipython wiki page).

Reply all

Reply to author

Forward