Hi,
I am one of those guys making the switch from R to Python and statsmodels seemed perfect for this. However, now I want to harness the power of multiprocessing Python gives us and would like to know if the modules in statsmodels are built to handle this (scikit-learn supports it).
Thanks for your time!
I imagine tsa.VAR would benefit some, especially with the model selection stuff.I made a little test of the joblib based on the MNLogit example, that re-runs MNLogit on various permutations of the original model. https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/joblib-mnlogit-example.pyHere's the output from running on an AMD 64 Vishera 8320 with 8 cores (but only 4 FPUs) running Ubuntu 13.10# Jobs time(s) Speedup multiple vs 1 job1 8.65 1.002 4.43 1.953 3.13 2.774 2.43 3.565 2.13 4.066 1.83 4.727 1.63 5.308 1.53 5.659 1.54 5.6110 1.54 5.63As you can see you can get quite a bit of speedup if you have enough work to do. One of the bottlenecks is making sure there is enough work relative to the memory transfer.
Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)
Does joblib have anyway of sharing memory? It seems really inefficient to have n copies of the data set being held in memory, especially with very big datasets.
On Thu, Dec 12, 2013 at 3:36 PM, Damien Moore <damien...@gmail.com> wrote:
I imagine tsa.VAR would benefit some, especially with the model selection stuff.I made a little test of the joblib based on the MNLogit example, that re-runs MNLogit on various permutations of the original model. https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/joblib-mnlogit-example.pyHere's the output from running on an AMD 64 Vishera 8320 with 8 cores (but only 4 FPUs) running Ubuntu 13.10# Jobs time(s) Speedup multiple vs 1 job1 8.65 1.002 4.43 1.953 3.13 2.774 2.43 3.565 2.13 4.066 1.83 4.727 1.63 5.308 1.53 5.659 1.54 5.6110 1.54 5.63As you can see you can get quite a bit of speedup if you have enough work to do. One of the bottlenecks is making sure there is enough work relative to the memory transfer.Thanks for looking into this.Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)The problem are the imports, since Windows doesn't "fork".
What might help a bit is not to use the statsmodels.api but use the import from the module.However, this currently still imports scipy and pandas.Does importing pandas still import matplotlib? I haven't checked in a while.
Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)The problem are the imports, since Windows doesn't "fork".
On Ubuntu however this is a good speedup, so maybe we should start with it sooner than what I see on Windows.
What might help a bit is not to use the statsmodels.api but use the import from the module.However, this currently still imports scipy and pandas.
Does importing pandas still import matplotlib? I haven't checked in a while.Another possible improvement (I guess) is to move the `#Get the data` code into the if __name__ .. part
It wouldn't make much difference for random permutation which can be transformed inplace.But it can make a big difference for parametric bootstrap or bootstrap conditional on the exog/design matrix, where we can just reuse the same array without changing or copying
Strangely this doesn't work at all on a different machine using Window 7 (i.e. more jobs results in slower execution)The problem are the imports, since Windows doesn't "fork".Ahh, that makes sense.On Ubuntu however this is a good speedup, so maybe we should start with it sooner than what I see on Windows.The multicore performance is how I justified my recent purchase of an AMD CPU over an otherwise far superior Intel CPU. I mostly plan to use it for econ/finance related monte carlo stuff that hopefully should scale even more nicely.What might help a bit is not to use the statsmodels.api but use the import from the module.However, this currently still imports scipy and pandas.Wouldn't Windows DLL caching still make the import pretty quick anyway? (This is obviously irrelevant on Linux). Maybe just not quick enough relative to the relative short time each job takes. I tried increasing the workload by running fit 10 times and still managed to get a slowdown instead of a speedup with higher numbers of jobs.
10 2.65 1.23(I got an exception with p+ovars and just used p in the slice)
Also as far as I understand your script (it has been a while), the `delayed` part creates a new process for each permutation. Pooling permutation would reduce the number of processes that need to be created. IIRC, I saw a speedup in this case before.
That's what I get when I run each fit() in `reg` 20 times:# Jobs time(s) Speedup multiple vs 1 job 1 58.01 1.00 2 40.60 1.43 3 31.83 1.82 4 27.75 2.09 5 24.41 2.38 6 22.53 2.57 7 21.76 2.67 8 20.04 2.89 9 19.40 2.99 10 19.83 2.93
starting with 7 jobs my cpu is at 100%
10 2.65 1.23(I got an exception with p+ovars and just used p in the slice)on python3, I think ovars is the wrong type. (Without the ovars, there's about half as much work overall)
Also as far as I understand your script (it has been a while), the `delayed` part creates a new process for each permutation. Pooling permutation would reduce the number of processes that need to be created. IIRC, I saw a speedup in this case before.Yes, but the extra processes don't have much of a cost on Linux and the advantage is that joblib makes sure that a new job is spawned after an old one finishes keeping the maximum number of cores busy. With lots of small unevenly sized tasks this makes a lot sense. Anyway, I would have thought the joblib devs would have a workaround for the lack of fork on windows. For example, they could spawn up to n_job processes and instead of letting them die, use IPC to send the next task. Maybe I just need to read the docs more carefully.
That's what I get when I run each fit() in `reg` 20 times:# Jobs time(s) Speedup multiple vs 1 job 1 58.01 1.00 2 40.60 1.43 3 31.83 1.82 4 27.75 2.09 5 24.41 2.38 6 22.53 2.57 7 21.76 2.67 8 20.04 2.89 9 19.40 2.99 10 19.83 2.93
starting with 7 jobs my cpu is at 100%Is this a hyperthreaded intel CPU? i.e. 4 real cores with 8 virtual cores? (i.e. 2 "threads" per core)
What is this supposed to do? I got an invalid index, index too large, IIRC.exog has 6 columns and ovars is much largerI used python 2.7.1 for this
Actually it looks like multithreaded part of MKL numpy is doing nothing good for performance on this windows machine.
With the default settings:
set MKL_NUM_THREADS=4
set MKL_DYNAMIC=TRUE
python reg-example.py
Time 52.20s
Now crippling the multithreading:
set MKL_NUM_THREADS=1
set MKL_DYNAMIC=FALSE
python reg-example.py
Time 32.99s
The time being measured is how long it takes to estimate the permutations of ANES MNLogits (with fit repeated 10 times to create enough work) without using joblib.
If I then use joblib/multiprocessing with MKL_NUM_THREADS still set to 1, I get ~3x speedup.
Actually it looks like multithreaded part of MKL numpy is doing nothing good for performance on this windows machine.
With the default settings:
set MKL_NUM_THREADS=4
set MKL_DYNAMIC=TRUE
python reg-example.py
Time 52.20sNow crippling the multithreading:
set MKL_NUM_THREADS=1
set MKL_DYNAMIC=FALSE
python reg-example.py
Time 32.99sThe time being measured is how long it takes to estimate the permutations of ANES MNLogits (with fit repeated 10 times to create enough work) without using joblib.
If I then use joblib/multiprocessing with MKL_NUM_THREADS still set to 1, I get ~3x speedup.