joblib looks very easy to use, and it is in the examples.
I wanted to see how much we could gain with joblib on Windows
(notebook with 4 core, 8 hyperthreaded?)
For heavy calculations the time is much shorter.
For light-weight calculations with large loops in the Parallel call,
it can take longer than with n_jobs=1, or the gains are small 10%.
In some examples even with n_jobs=-1, the cpu usage hovers only around 13%
My impression is for light weight Monte Carlo, permutation, bootstrap
test, it might not be worth it, or requires defining jobs in batches.
For heavy duty like kalman filter the gains could be large.
Adding an import "from scipy import stats" reduces the performance of
light-weight functions under multi-processing, making them much slower
than without the scipy import, and in the last 3 cases n_jobs>1 is
slower than single process. The effect of the scipy import in heavier
functions is small (relative) as expected.
(Time to break up or replace scipy.stats?)
10000
1 1.17000007629
2 1.57500004768
3 1.62299990654
-1 2.15300011635
1000
1 1.04499983788
2 1.21700000763
3 1.09200000763
-1 1.40400004387
100
1 1.02900004387
2 1.1859998703
3 1.04500007629
-1 1.37299990654
I'm still only getting started with this, and the test examples are
pretty simple.
The last looks like an expected difference between Windows and Linux.
Are the other observations roughly the same under Linux?
Josef
example:
rather than:
>>> out = Parallel(n_jobs=4)(delayed(fund)(x) for x in X)
to loop over each row of X, you could use:
>>> out = Parallel(n_jobs=4)(delayed(fund)(x_batch) for x_batch in np.array_split(X, 4)]
to loop over chunks of rows.
Alex
Thanks, that looks good
max_abs = np.concatenate(parallel(my_max_stat(X, X2, p, dof_scaling)
for p in np.array_split(perms, n_jobs)))
this reduces the overhead the most, and I didn't understand the
meaning of this before. (I didn't know np.array_split)
It needs an intermediate function to do the batching, though.
Your use of joblib in the permutation test, and the use in
sklearn.cross_validation were the main reasons for me to look at it.
Still, because of the higher cost to setup a process on Windows the
threshold were multiprocessing is an advantage might be higher than on
Linux. (an import of scipy.stats still kills the advantage of
multiprocessing even with n_jobs=4, for the last example, processing
time around 1 second)
n_jobs=-1 doesn't seem to get it quite right
4 iterations in Parallel loop
n_jobs time.time diff
1 9.92100000381
2 6.11599993706
3 5.99000000954
4 3.75999999046
6 3.97800016403
8 4.25899982452
-1 4.16499996185
Thanks,
Josef
I usually use joblib for computations of more than 10s and commonly of
a few minutes, even hours.
Joblib is not worth it if a job is less than a second.
Alex
I haven't thought about it, but the KF is not embarrassingly parallel
(indeed, it's recursive), so it's not at first obvious to me how to
parallelize it. We could evaluate the filter and the gradient for
example at the same time, though I don't know if each evaluation is
costly enough (I suspect not).
I get these timings running your script out of the box (and with other
things eating up CPU cycles at the moment).
10000
1 1.70737504959
2 1.4141061306
3 1.41899490356
-1 1.42400503159
1000
1 1.48994278908
2 0.909826040268
3 0.611446142197
-1 0.5164270401
100
1 1.48153614998
2 0.808979034424
3 0.610709905624
-1 0.418374061584
Skipper
I don't think there is much to gain for a KF estimation, I was
thinking more in terms of Monte Carlo or bootstrap where KF is in the
loop.
Given also Alex's reply: Most of our models and statistical tests are
pretty fast (unless the dataset is large(?)), so to take advantage of
joblib or multiprocessing we have to construct "bigger" jobs to run,
or create large enough batches.
Just putting joblib over a few estimation or tests runs, will in many
cases not be time consuming enough.
That's my main conclusions of this.
Skipper, can you add : `from scipy import stats` at the top of the
script and run again to see what's the difference.
You are on Linux I assume
Thanks,
Josef
Oh, right.
> Given also Alex's reply: Most of our models and statistical tests are
> pretty fast (unless the dataset is large(?)), so to take advantage of
> joblib or multiprocessing we have to construct "bigger" jobs to run,
> or create large enough batches.
> Just putting joblib over a few estimation or tests runs, will in many
> cases not be time consuming enough.
>
> That's my main conclusions of this.
Mine as well.
I am. With import at the top of the script
10000
1 1.71545100212
2 1.42330384254
3 1.41989302635
-1 1.51475715637
1000
1 1.50667691231
2 0.806510925293
3 0.607214927673
-1 0.509503126144
100
1 1.47279000282
2 0.806162118912
3 0.607878923416
-1 0.509881973267
It's not clear to me why this would degrade performance. Is this
simply because on my same machine imports in windows seem to take 5
times as long?
Skipper
Robert Kern explained it, and some other details on the scikits.learn
mailing list.
Essentially, Linux forks, Windows creates a new process and has to
newly load python and import all packages in each process. That's how
I understand it.
see https://github.com/statsmodels/statsmodels/issues/72
Another reason for me to be allergic to unused, un-lazy imports, and a
big advantage of our `.api` restructuring.
Josef
>
> Skipper
>
And sklearn is in the "import life, universe, everything" camp.
>>> before = copy.copy(sys.modules)
>>> len(before)
132
>>> from sklearn.externals.joblib import Parallel, delayed
>>> after = copy.copy(sys.modules)
>>> len(after)
552
this counts numpy which we cannot avoid, but also a large amount of
scipy, besides, I guess, most of sklearn
If I replace from sklearn.externals.joblib import Parallel, delayed
with from joblib import Parallel, delayed
then I get much better numbers also on Windows
10000
1 1.20499992371
2 1.34500002861
3 1.27999997139
4 1.38000011444
6 1.45499992371
8 1.61500000954
-1 1.625
1000
1 1.0649998188
2 0.825000047684
3 0.65499997139
4 0.579999923706
6 0.590000152588
8 0.644999980927
-1 0.644999980927
100
1 1.05999994278
2 0.759999990463
3 0.630000114441
4 0.564999818802
6 0.570000171661
8 0.629999876022
-1 0.625
4
1 1.05999994278
2 0.745000123978
3 0.77999997139
4 0.56500005722
6 0.649999856949
8 0.735000133514
-1 0.730000019073
also twice as fast in a 4 second loop
4
1 3.9470000267
2 2.71399998665
3 2.7610001564
4 1.93499994278
6 2.05900001526
8 2.3869998455
-1 2.32400012016
(My notebook is as usual full 6.8 GB of 8GB RAM used up, but not much
CPU load, so the numbers could change.)
Josef
>
> Josef
>
>>
>> Skipper
>>
>