playing with joblib

josef...@gmail.com

unread,

Oct 10, 2011, 5:12:52 PM10/10/11

to pystat...@googlegroups.com

Is it worth it?

joblib looks very easy to use, and it is in the examples.
I wanted to see how much we could gain with joblib on Windows
(notebook with 4 core, 8 hyperthreaded?)

For heavy calculations the time is much shorter.

For light-weight calculations with large loops in the Parallel call,
it can take longer than with n_jobs=1, or the gains are small 10%.
In some examples even with n_jobs=-1, the cpu usage hovers only around 13%

My impression is for light weight Monte Carlo, permutation, bootstrap
test, it might not be worth it, or requires defining jobs in batches.
For heavy duty like kalman filter the gains could be large.

Adding an import "from scipy import stats" reduces the performance of
light-weight functions under multi-processing, making them much slower
than without the scipy import, and in the last 3 cases n_jobs>1 is
slower than single process. The effect of the scipy import in heavier
functions is small (relative) as expected.
(Time to break up or replace scipy.stats?)

10000
1 1.17000007629
2 1.57500004768
3 1.62299990654
-1 2.15300011635

1000
1 1.04499983788
2 1.21700000763
3 1.09200000763
-1 1.40400004387

100
1 1.02900004387
2 1.1859998703
3 1.04500007629
-1 1.37299990654

I'm still only getting started with this, and the test examples are
pretty simple.

The last looks like an expected difference between Windows and Linux.
Are the other observations roughly the same under Linux?

Josef

try_joblib_0.py

Alexandre Gramfort

unread,

Oct 10, 2011, 8:47:14 PM10/10/11

to pystat...@googlegroups.com

I use joblib almost everyday and I confirm that it's worth it when the jobs
are not too short. If it's the case you should work with batches.

example:

rather than:

>>> out = Parallel(n_jobs=4)(delayed(fund)(x) for x in X)

to loop over each row of X, you could use:

>>> out = Parallel(n_jobs=4)(delayed(fund)(x_batch) for x_batch in np.array_split(X, 4)]

to loop over chunks of rows.

Alex

josef...@gmail.com

unread,

Oct 10, 2011, 10:31:20 PM10/10/11

to pystat...@googlegroups.com

On Mon, Oct 10, 2011 at 8:47 PM, Alexandre Gramfort
<alexandre...@gmail.com> wrote:
> I use joblib almost everyday and I confirm that it's worth it when the jobs
> are not too short. If it's the case you should work with batches.
>
> example:
>
> rather than:
>
>>>> out = Parallel(n_jobs=4)(delayed(fund)(x) for x in X)
>
> to loop over each row of X, you could use:
>
>>>> out = Parallel(n_jobs=4)(delayed(fund)(x_batch) for x_batch in np.array_split(X, 4)]
>
> to loop over chunks of rows.

Thanks, that looks good

max_abs = np.concatenate(parallel(my_max_stat(X, X2, p, dof_scaling)
for p in np.array_split(perms, n_jobs)))

this reduces the overhead the most, and I didn't understand the
meaning of this before. (I didn't know np.array_split)
It needs an intermediate function to do the batching, though.

Your use of joblib in the permutation test, and the use in
sklearn.cross_validation were the main reasons for me to look at it.

Still, because of the higher cost to setup a process on Windows the
threshold were multiprocessing is an advantage might be higher than on
Linux. (an import of scipy.stats still kills the advantage of
multiprocessing even with n_jobs=4, for the last example, processing
time around 1 second)

n_jobs=-1 doesn't seem to get it quite right

4 iterations in Parallel loop
n_jobs time.time diff
1 9.92100000381
2 6.11599993706
3 5.99000000954
4 3.75999999046
6 3.97800016403
8 4.25899982452
-1 4.16499996185

Thanks,

Josef

Alexandre Gramfort

unread,

Oct 11, 2011, 8:43:44 AM10/11/11

to pystat...@googlegroups.com

> Your use of joblib in the permutation test, and the use in
> sklearn.cross_validation were the main reasons for me to look at it.
>
> Still, because of the higher cost to setup a process on Windows the
> threshold were multiprocessing is an advantage might be higher than on
> Linux. (an import of scipy.stats still kills the advantage of
> multiprocessing even with n_jobs=4, for the last example, processing
> time around 1 second)

I usually use joblib for computations of more than 10s and commonly of
a few minutes, even hours.
Joblib is not worth it if a job is less than a second.

Alex

Skipper Seabold

unread,

Oct 11, 2011, 9:24:43 AM10/11/11

to pystat...@googlegroups.com

On Mon, Oct 10, 2011 at 5:12 PM, <josef...@gmail.com> wrote:

> My impression is for light weight Monte Carlo, permutation, bootstrap
> test, it might not be worth it, or requires defining jobs in batches.
> For heavy duty like kalman filter the gains could be large.
>

I haven't thought about it, but the KF is not embarrassingly parallel
(indeed, it's recursive), so it's not at first obvious to me how to
parallelize it. We could evaluate the filter and the gradient for
example at the same time, though I don't know if each evaluation is
costly enough (I suspect not).

I get these timings running your script out of the box (and with other
things eating up CPU cycles at the moment).

10000
1 1.70737504959
2 1.4141061306
3 1.41899490356
-1 1.42400503159

1000
1 1.48994278908
2 0.909826040268
3 0.611446142197
-1 0.5164270401

100
1 1.48153614998
2 0.808979034424
3 0.610709905624
-1 0.418374061584

Skipper

josef...@gmail.com

unread,

Oct 11, 2011, 9:48:23 AM10/11/11

to pystat...@googlegroups.com

On Tue, Oct 11, 2011 at 9:24 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Mon, Oct 10, 2011 at 5:12 PM, <josef...@gmail.com> wrote:
>> My impression is for light weight Monte Carlo, permutation, bootstrap
>> test, it might not be worth it, or requires defining jobs in batches.
>> For heavy duty like kalman filter the gains could be large.
>>
>
> I haven't thought about it, but the KF is not embarrassingly parallel
> (indeed, it's recursive), so it's not at first obvious to me how to
> parallelize it. We could evaluate the filter and the gradient for
> example at the same time, though I don't know if each evaluation is
> costly enough (I suspect not).

I don't think there is much to gain for a KF estimation, I was
thinking more in terms of Monte Carlo or bootstrap where KF is in the
loop.

Given also Alex's reply: Most of our models and statistical tests are
pretty fast (unless the dataset is large(?)), so to take advantage of
joblib or multiprocessing we have to construct "bigger" jobs to run,
or create large enough batches.
Just putting joblib over a few estimation or tests runs, will in many
cases not be time consuming enough.

That's my main conclusions of this.

Skipper, can you add : `from scipy import stats` at the top of the
script and run again to see what's the difference.
You are on Linux I assume

Thanks,

Josef

Skipper Seabold

unread,

Oct 11, 2011, 10:20:16 AM10/11/11

to pystat...@googlegroups.com

On Tue, Oct 11, 2011 at 9:48 AM, <josef...@gmail.com> wrote:
> On Tue, Oct 11, 2011 at 9:24 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> > On Mon, Oct 10, 2011 at 5:12 PM, <josef...@gmail.com> wrote:
> >> My impression is for light weight Monte Carlo, permutation, bootstrap
> >> test, it might not be worth it, or requires defining jobs in batches.
> >> For heavy duty like kalman filter the gains could be large.
> >>
> >
> > I haven't thought about it, but the KF is not embarrassingly parallel
> > (indeed, it's recursive), so it's not at first obvious to me how to
> > parallelize it. We could evaluate the filter and the gradient for
> > example at the same time, though I don't know if each evaluation is
> > costly enough (I suspect not).
>
> I don't think there is much to gain for a KF estimation, I was
> thinking more in terms of Monte Carlo or bootstrap where KF is in the
> loop.

Oh, right.

> Given also Alex's reply: Most of our models and statistical tests are
> pretty fast (unless the dataset is large(?)), so to take advantage of
> joblib or multiprocessing we have to construct "bigger" jobs to run,
> or create large enough batches.
> Just putting joblib over a few estimation or tests runs, will in many
> cases not be time consuming enough.
>
> That's my main conclusions of this.

Mine as well.

I am. With import at the top of the script

10000
1 1.71545100212
2 1.42330384254
3 1.41989302635
-1 1.51475715637

1000
1 1.50667691231
2 0.806510925293
3 0.607214927673
-1 0.509503126144

100
1 1.47279000282
2 0.806162118912
3 0.607878923416
-1 0.509881973267

It's not clear to me why this would degrade performance. Is this
simply because on my same machine imports in windows seem to take 5
times as long?

Skipper

josef...@gmail.com

unread,

Oct 11, 2011, 10:28:23 AM10/11/11

to pystat...@googlegroups.com

Robert Kern explained it, and some other details on the scikits.learn
mailing list.

Essentially, Linux forks, Windows creates a new process and has to
newly load python and import all packages in each process. That's how
I understand it.

see https://github.com/statsmodels/statsmodels/issues/72

Another reason for me to be allergic to unused, un-lazy imports, and a
big advantage of our `.api` restructuring.

Josef

>
> Skipper
>

josef...@gmail.com

unread,

Oct 11, 2011, 12:35:08 PM10/11/11

to pystat...@googlegroups.com

And sklearn is in the "import life, universe, everything" camp.

>>> before = copy.copy(sys.modules)
>>> len(before)
132
>>> from sklearn.externals.joblib import Parallel, delayed
>>> after = copy.copy(sys.modules)
>>> len(after)
552
this counts numpy which we cannot avoid, but also a large amount of
scipy, besides, I guess, most of sklearn

If I replace from sklearn.externals.joblib import Parallel, delayed
with from joblib import Parallel, delayed

then I get much better numbers also on Windows

10000
1 1.20499992371
2 1.34500002861
3 1.27999997139
4 1.38000011444
6 1.45499992371
8 1.61500000954
-1 1.625

1000
1 1.0649998188
2 0.825000047684
3 0.65499997139
4 0.579999923706
6 0.590000152588
8 0.644999980927
-1 0.644999980927

100
1 1.05999994278
2 0.759999990463
3 0.630000114441
4 0.564999818802
6 0.570000171661
8 0.629999876022
-1 0.625

4
1 1.05999994278
2 0.745000123978
3 0.77999997139
4 0.56500005722
6 0.649999856949
8 0.735000133514
-1 0.730000019073

also twice as fast in a 4 second loop
4
1 3.9470000267
2 2.71399998665
3 2.7610001564
4 1.93499994278
6 2.05900001526
8 2.3869998455
-1 2.32400012016

(My notebook is as usual full 6.8 GB of 8GB RAM used up, but not much
CPU load, so the numbers could change.)