Does statsmodels use multiprocessing?

a_b

unread,

Dec 11, 2013, 5:56:05 AM12/11/13

to pystat...@googlegroups.com

Hi,

I am one of those guys making the switch from R to Python and statsmodels seemed perfect for this. However, now I want to harness the power of multiprocessing Python gives us and would like to know if the modules in statsmodels are built to handle this (scikit-learn supports it).

Thanks for your time!

josef...@gmail.com

unread,

Dec 11, 2013, 9:01:39 AM12/11/13

to pystatsmodels

On Wed, Dec 11, 2013 at 5:56 AM, a_b <ajit....@gmail.com> wrote:

Hi,

I am one of those guys making the switch from R to Python and statsmodels seemed perfect for this. However, now I want to harness the power of multiprocessing Python gives us and would like to know if the modules in statsmodels are built to handle this (scikit-learn supports it).

Yes and no,

We have a helper function to make it easier to use joblib (copied from one of the sklearn authors from MNEPython).

We use it in nonparametrics, I thought also for tsa.VAR, but a quick search didn't find it.

However, with a few exceptions, statsmodels doesn't have any candidates for internal multiprocessing, since we don't have many algorithms that can be parallelized and take long enough time to be worth it for multiprocessing.

Currently we don't have many places where we do cross-validation or bootstrap that would require running the same model many times.

If users want to estimate many models for example, then it's better to pool cases so each process has a large enough amount of work to do.

That's largely based on some experiments I did with joblib some time ago.

If anyone sees additional cases where we can take advantage of joblib, then opening an issue or pull request would be helpful to improve this.

Josef

Thanks for your time!

Damien Moore

unread,

Dec 12, 2013, 3:36:33 PM12/12/13

to pystat...@googlegroups.com

I imagine tsa.VAR would benefit some, especially with the model selection stuff.

I made a little test of the joblib based on the MNLogit example, that re-runs MNLogit on various permutations of the original model. https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/joblib-mnlogit-example.py

Here's the output from running on an AMD 64 Vishera 8320 with 8 cores (but only 4 FPUs) running Ubuntu 13.10

# Jobs time(s) Speedup multiple vs 1 job

1 8.65 1.00

2 4.43 1.95

3 3.13 2.77

4 2.43 3.56

5 2.13 4.06

6 1.83 4.72

7 1.63 5.30

8 1.53 5.65

9 1.54 5.61

10 1.54 5.63

As you can see you can get quite a bit of speedup if you have enough work to do. One of the bottlenecks is making sure there is enough work relative to the memory transfer.

Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)

Does joblib have anyway of sharing memory? It seems really inefficient to have n copies of the data set being held in memory, especially with very big datasets.

josef...@gmail.com

unread,

Dec 12, 2013, 3:54:14 PM12/12/13

to pystatsmodels

On Thu, Dec 12, 2013 at 3:36 PM, Damien Moore <damien...@gmail.com> wrote:

I imagine tsa.VAR would benefit some, especially with the model selection stuff.

I made a little test of the joblib based on the MNLogit example, that re-runs MNLogit on various permutations of the original model. https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/joblib-mnlogit-example.py

Here's the output from running on an AMD 64 Vishera 8320 with 8 cores (but only 4 FPUs) running Ubuntu 13.10

# Jobs time(s) Speedup multiple vs 1 job

1 8.65 1.00
2 4.43 1.95

3 3.13 2.77
4 2.43 3.56

5 2.13 4.06
6 1.83 4.72

7 1.63 5.30
8 1.53 5.65

9 1.54 5.61
10 1.54 5.63

As you can see you can get quite a bit of speedup if you have enough work to do. One of the bottlenecks is making sure there is enough work relative to the memory transfer.

Thanks for looking into this.

Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)

The problem are the imports, since Windows doesn't "fork".

What might help a bit is not to use the statsmodels.api but use the import from the module.

However, this currently still imports scipy and pandas.

Does importing pandas still import matplotlib? I haven't checked in a while.

Does joblib have anyway of sharing memory? It seems really inefficient to have n copies of the data set being held in memory, especially with very big datasets.

I think there is a way to use memory maps but I never looked at this

It wouldn't make much difference for random permutation which can be transformed inplace.

But it can make a big difference for parametric bootstrap or bootstrap conditional on the exog/design matrix, where we can just reuse the same array without changing or copying it.

Josef

josef...@gmail.com

unread,

Dec 12, 2013, 4:04:37 PM12/12/13

to pystatsmodels

On Thu, Dec 12, 2013 at 3:54 PM, <josef...@gmail.com> wrote:

On Thu, Dec 12, 2013 at 3:36 PM, Damien Moore <damien...@gmail.com> wrote:

I imagine tsa.VAR would benefit some, especially with the model selection stuff.

I made a little test of the joblib based on the MNLogit example, that re-runs MNLogit on various permutations of the original model. https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/joblib-mnlogit-example.py

Here's the output from running on an AMD 64 Vishera 8320 with 8 cores (but only 4 FPUs) running Ubuntu 13.10

# Jobs time(s) Speedup multiple vs 1 job

1 8.65 1.00
2 4.43 1.95

3 3.13 2.77
4 2.43 3.56

5 2.13 4.06
6 1.83 4.72

7 1.63 5.30
8 1.53 5.65

9 1.54 5.61
10 1.54 5.63

As you can see you can get quite a bit of speedup if you have enough work to do. One of the bottlenecks is making sure there is enough work relative to the memory transfer.

Thanks for looking into this.

Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)

The problem are the imports, since Windows doesn't "fork".

On Ubuntu however this is a good speedup, so maybe we should start with it sooner than what I see on Windows.

What might help a bit is not to use the statsmodels.api but use the import from the module.

However, this currently still imports scipy and pandas.
Does importing pandas still import matplotlib? I haven't checked in a while.

Another possible improvement (I guess) is to move the `#Get the data` code into the if __name__ .. part

Josef

Damien Moore

unread,

Dec 12, 2013, 5:03:33 PM12/12/13

to pystat...@googlegroups.com

Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)

The problem are the imports, since Windows doesn't "fork".

Ahh, that makes sense.

On Ubuntu however this is a good speedup, so maybe we should start with it sooner than what I see on Windows.

The multicore performance is how I justified my recent purchase of an AMD CPU over an otherwise far superior Intel CPU. I mostly plan to use it for econ/finance related monte carlo stuff that hopefully should scale even more nicely.

What might help a bit is not to use the statsmodels.api but use the import from the module.

However, this currently still imports scipy and pandas.

Wouldn't Windows DLL caching still make the import pretty quick anyway? (This is obviously irrelevant on Linux). Maybe just not quick enough relative to the relative short time each job takes. I tried increasing the workload by running fit 10 times and still managed to get a slowdown instead of a speedup with higher numbers of jobs.

Does importing pandas still import matplotlib? I haven't checked in a while.

Another possible improvement (I guess) is to move the `#Get the data` code into the if __name__ .. part

I had already tried this (just forgot to commit) and it didn't make much difference (on windows, obviously none at all on Linux).

Damien Moore

unread,

Dec 12, 2013, 5:47:31 PM12/12/13

to pystat...@googlegroups.com

It wouldn't make much difference for random permutation which can be transformed inplace.

But it can make a big difference for parametric bootstrap or bootstrap conditional on the exog/design matrix, where we can just reuse the same array without changing or copying

Re memory... I initially started playing around with a 1 million observation Probit/Logit, running different nested specifications as separate jobs. Memory use escalates quite a bit with the number of jobs. In a single job run I use around 3.5GB, whereas an 8 job run used about 13.5GB. I can probably get memory use down a bit because the dataframe included a lot of unused vars, but its nonetheless informative. Speed improvements were comparable, but because I didn't have a lot of jobs, it was harder to get a balanced load.

josef...@gmail.com

unread,

Dec 12, 2013, 6:45:48 PM12/12/13

to pystatsmodels

On Thu, Dec 12, 2013 at 5:03 PM, Damien Moore <damien...@gmail.com> wrote:

Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)

The problem are the imports, since Windows doesn't "fork".

Ahh, that makes sense.

On Ubuntu however this is a good speedup, so maybe we should start with it sooner than what I see on Windows.

The multicore performance is how I justified my recent purchase of an AMD CPU over an otherwise far superior Intel CPU. I mostly plan to use it for econ/finance related monte carlo stuff that hopefully should scale even more nicely.

What might help a bit is not to use the statsmodels.api but use the import from the module.

However, this currently still imports scipy and pandas.

Wouldn't Windows DLL caching still make the import pretty quick anyway? (This is obviously irrelevant on Linux). Maybe just not quick enough relative to the relative short time each job takes. I tried increasing the workload by running fit 10 times and still managed to get a slowdown instead of a speedup with higher numbers of jobs.

DLL caching makes a difference. I know when I start cold a new python session it takes "forever" to get the ready prompt. Later at least I cannot see a large difference anymore when I ran a python script.

As I said I only "played" with this one or two afternoons, for permutation tests.

With loading scipy, I needed about 10 seconds per job to be worth it. Without importing scipy it was less than half.

With your script an small changes I get on a quadcore notebook with Windows 7 32bit python (8 virtual cores)

# Jobs time(s) Speedup multiple vs 1 job

1 3.25 1.00

2 2.69 1.21

3 2.21 1.47

4 2.11 1.54

5 2.12 1.53

6 2.63 1.24

7 2.08 1.57

8 2.32 1.40

9 2.45 1.32

10 2.65 1.23

(I got an exception with p+ovars and just used p in the slice) Also as far as I understand your script (it has been a while), the `delayed` part creates a new process for each permutation. Pooling permutation would reduce the number of processes that need to be created. IIRC, I saw a speedup in this case before.

That's what I get when I run each fit() in `reg` 20 times:

# Jobs time(s) Speedup multiple vs 1 job 1 58.01 1.00 2 40.60 1.43 3 31.83 1.82 4 27.75 2.09 5 24.41 2.38 6 22.53 2.57 7 21.76 2.67 8 20.04 2.89 9 19.40 2.99 10 19.83 2.93

starting with 7 jobs my cpu is at 100%

Josef

Damien Moore

unread,

Dec 12, 2013, 9:42:32 PM12/12/13

to pystat...@googlegroups.com

10 2.65 1.23

(I got an exception with p+ovars and just used p in the slice)

on python3, I think ovars is the wrong type. (Without the ovars, there's about half as much work overall)

Also as far as I understand your script (it has been a while), the `delayed` part creates a new process for each permutation. Pooling permutation would reduce the number of processes that need to be created. IIRC, I saw a speedup in this case before.

Yes, but the extra processes don't have much of a cost on Linux and the advantage is that joblib makes sure that a new job is spawned after an old one finishes keeping the maximum number of cores busy. With lots of small unevenly sized tasks this makes a lot sense. Anyway, I would have thought the joblib devs would have a workaround for the lack of fork on windows. For example, they could spawn up to n_job processes and instead of letting them die, use IPC to send the next task. Maybe I just need to read the docs more carefully.

That's what I get when I run each fit() in `reg` 20 times:

# Jobs time(s) Speedup multiple vs 1 job 1 58.01 1.00 2 40.60 1.43 3 31.83 1.82 4 27.75 2.09 5 24.41 2.38 6 22.53 2.57 7 21.76 2.67 8 20.04 2.89 9 19.40 2.99 10 19.83 2.93

starting with 7 jobs my cpu is at 100%

Is this a hyperthreaded intel CPU? i.e. 4 real cores with 8 virtual cores? (i.e. 2 "threads" per core)

josef...@gmail.com

unread,

Dec 12, 2013, 10:02:39 PM12/12/13

to pystatsmodels

On Thu, Dec 12, 2013 at 9:42 PM, Damien Moore <damien...@gmail.com> wrote:

10 2.65 1.23

(I got an exception with p+ovars and just used p in the slice)

on python3, I think ovars is the wrong type. (Without the ovars, there's about half as much work overall)

What is this supposed to do? I got an invalid index, index too large, IIRC.

exog has 6 columns and ovars is much larger

I used python 2.7.1 for this

Also as far as I understand your script (it has been a while), the `delayed` part creates a new process for each permutation. Pooling permutation would reduce the number of processes that need to be created. IIRC, I saw a speedup in this case before.

Yes, but the extra processes don't have much of a cost on Linux and the advantage is that joblib makes sure that a new job is spawned after an old one finishes keeping the maximum number of cores busy. With lots of small unevenly sized tasks this makes a lot sense. Anyway, I would have thought the joblib devs would have a workaround for the lack of fork on windows. For example, they could spawn up to n_job processes and instead of letting them die, use IPC to send the next task. Maybe I just need to read the docs more carefully.

For Monte Carlo or bootstrap it might be possible to spread the load quite evenly, plus it averages out if you need to run a few hundred estimation problems per process.

If you find something, then please report back.

I'm a bit pessimistic because it's not needed on Linux, and the developers are exclusively Unix/Linux users AFAIK.

That's what I get when I run each fit() in `reg` 20 times:

# Jobs time(s) Speedup multiple vs 1 job 1 58.01 1.00 2 40.60 1.43 3 31.83 1.82 4 27.75 2.09 5 24.41 2.38 6 22.53 2.57 7 21.76 2.67 8 20.04 2.89 9 19.40 2.99 10 19.83 2.93

starting with 7 jobs my cpu is at 100%

Is this a hyperthreaded intel CPU? i.e. 4 real cores with 8 virtual cores? (i.e. 2 "threads" per core)

Yes, an i7 in a notebook, I'm low in RAM and 7 of my 8GB are in use, by some hungry Firefox and Chrome (once the number of open tabs gets laaarge), but that wasn't a problem in your example script.

Josef

Damien Moore

unread,

Dec 13, 2013, 3:33:44 PM12/13/13

to pystat...@googlegroups.com

What is this supposed to do? I got an invalid index, index too large, IIRC.
exog has 6 columns and ovars is much larger
I used python 2.7.1 for this

I was using an older version of statsmodels (0.4.3 on windows), which had a 9 column exog set for ANES. If I run on your version of the code with 5 columns, my speed on Linux goes from 4.0s (1 "job") to 0.84s (8 "jobs") so roughly same speedup as before.

Re my windows issues, I finally figured out that because I was using the MKL version of numpy that Gohlke provides, I was already getting a multicore speed bump. Running the same set of regs without any joblib shows 90% CPU Usage in Windows Task Manager. Adding a second job bascially maxed things out completely with a modest 30% speed bump over the run without joblib. Oddly, max processor usage is more like 50% if I do something involving bigger data sets (maybe MKL speed-up is only in cache). The windows machine is Quad core Intel Xeon 5570 (hyper threaded) running 64 bit python.

I gather you weren't running MKL enabled numpy when you got the 3x speed up you show above?

josef...@gmail.com

unread,

Dec 13, 2013, 4:11:59 PM12/13/13

to pystatsmodels

That's good to hear, I haven't checked yet whether the MKL versions are multithreaded. I know in GAUSS the speedup from multicore math libraries (Lapack I guess) were pretty good.

I have several 64-bit pythons with Goehlke libraries installed, but most of the time I use a boring 32-bit version with official numpy and scipy, because then it's easy to compile extensions with MingW.

With larger datasets shuffling the data around might "starve" the CPU. One of the next tasks for statsmodels is to play with a memory profiler.

I think we do avoid to a large extend that we make copies of arrays, but we save also quite a few lazy cached arrays, which depends on what you access.

One possible source of inefficient looping is if the convenient pandas and patsy interfaces are doing repeatedly the same data handling.

Are you doing cross-validation for the selection of explanatory variables, or Monte Carlo for a given set of explanatory variables?

I started to think about whether and where we could introduce some backdoors or shortcuts to make cross-validation or bootstrap loops more efficient, but didn't get very far yet.

Josef

Damien Moore

unread,

Dec 13, 2013, 6:22:23 PM12/13/13

to pystat...@googlegroups.com

Actually it looks like multithreaded part of MKL numpy is doing nothing good for performance on this windows machine.

With the default settings:

set MKL_NUM_THREADS=4
set MKL_DYNAMIC=TRUE
python reg-example.py
Time 52.20s

Now crippling the multithreading:

set MKL_NUM_THREADS=1
set MKL_DYNAMIC=FALSE
python reg-example.py
Time 32.99s

The time being measured is how long it takes to estimate the permutations of ANES MNLogits (with fit repeated 10 times to create enough work) without using joblib.

If I then use joblib/multiprocessing with MKL_NUM_THREADS still set to 1, I get ~3x speedup.

josef...@gmail.com

unread,

Dec 20, 2013, 9:00:16 AM12/20/13

to pystatsmodels

On Fri, Dec 13, 2013 at 6:22 PM, Damien Moore <damien...@gmail.com> wrote:

Actually it looks like multithreaded part of MKL numpy is doing nothing good for performance on this windows machine.

With the default settings:

set MKL_NUM_THREADS=4
set MKL_DYNAMIC=TRUE
python reg-example.py
Time 52.20s

Now crippling the multithreading:

set MKL_NUM_THREADS=1
set MKL_DYNAMIC=FALSE
python reg-example.py
Time 32.99s

The time being measured is how long it takes to estimate the permutations of ANES MNLogits (with fit repeated 10 times to create enough work) without using joblib.

If I then use joblib/multiprocessing with MKL_NUM_THREADS still set to 1, I get ~3x speedup.

I think MNLogit is not a good candidate for multiprocessing by math libraries. There is almost no linear algebra and the arrays in the example are too small to benefit even from a vectorized dot.

I expect that it would be more useful in case of large linear models, where the linear algebra, pinv/svd or QR and dot products, take most of the time.

In case someone didn't see the scipy-user mailing list thread on related multiprocessing and leastsq discussion