Pandas and Multiprocessing

Ryan Turner

unread,

Jun 23, 2015, 11:28:16 AM6/23/15

to pyd...@googlegroups.com

Hi, I have a DataFrame and I would like to share it between threads (multiprocessing.Pool). I only read the DataFrame so synchronization isn't an issue.

I currently send a complete copy of the DataFrame to each thread which works but makes starting the threads very slow and wastes an enormous amount of memory.

Any time I try to search for solutions to this problem people suggest using the copy-on-write behavior of Linux but I'm on Windows so I need a more explicit solution.

rtem...@gmail.com

unread,

Jun 23, 2015, 11:34:21 AM6/23/15

to pyd...@googlegroups.com

Try using blaze or dask and asking in the blaze user group: https://groups.google.com/a/continuum.io/forum/#!forum/blaze-dev

Chris Withers

unread,

Jun 23, 2015, 7:45:42 PM6/23/15

to pyd...@googlegroups.com

I presume the OP means processes, not threads?

How does Blaze or Dask help them?

Sounds like he's more looking to put a DataFrame in some shared memory, how do you do that?

cheers,

Chris

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

Todd

unread,

Jun 24, 2015, 12:49:08 AM6/24/15

to pyd...@googlegroups.com

It depends on exactly what you want to do with the DataFrame. If you only need part of it in each process, you can divide it up and send just one part of it at a time (like with groupby or iterrows).

If that isn't the case, there isn't much you can do. The windows multiprocessing capabilities are very different than those of pretty much any other modern operating system, and you are encountering one of those issues.

Richard Stanton

unread,

Jun 24, 2015, 12:39:13 PM6/24/15

to pyd...@googlegroups.com

You can share data if you use multiple threads rather than processes. However, then you get bitten by Python’s Global Interpreter Lock (GIL), which means that only one thread can access the CPU at a time, pretty much eliminating the advantage of using multiple threads in the first place (at least for CPU-bound work)! You can, however, get around the GIL by writing multithreaded code in Cython and explicitly turning off the GIL for the stuff you want to run in parallel.

Here’s a brief, very readable presentation by Francesc Alted that talks about this and other ways around the GIL:

https://python.g-node.org/python-summerschool-2011/_media/materials/parallel/parallelcython.pdf

Best,

Richard Stanton

Ryan Turner

unread,

Jun 24, 2015, 1:00:53 PM6/24/15

to pyd...@googlegroups.com

My data fits in memory so I'd like to avoid having to learn another library. I also rely on the functionality Pandas provides.

My problem is basically this: https://groups.google.com/d/msg/pydata/Emkkk9S9rUk/ov0Tw7c6DfwJ

Are we just screwed on windows?

John Readey

unread,

Jun 24, 2015, 1:00:54 PM6/24/15

to pyd...@googlegroups.com

This is an interesting approach to parallelism. Is there a list of Pandas routines that take advantage of multi-core machines by using Cython multithreading?

John

Stephan Hoyer

unread,

Jun 24, 2015, 1:24:31 PM6/24/15

to pyd...@googlegroups.com, pyd...@googlegroups.com

Jeff Reback has recently done some work to release the GIL in pandas' internal routines written in Cython. This should make it in the next release of pandas (0.17): https://github.com/pydata/pandas/pull/10199

You'll still need to use your own multithreading pool to take advantage of this. But it will make multithreading feasible (instead of only multiprocessing).

--

Jeff

unread,

Jun 24, 2015, 1:28:27 PM6/24/15

to pyd...@googlegroups.com

I always recommend going to parallel code as a last resort.

Always use idioms, pandas builtins, iterators and vectorize first.

Assume you have done this, then see this PR:https://github.com/pydata/pandas/pull/10199

which will shortly be merged for 0.17.0.

This will release the GIL on all groupby and lots of other operations that are interesting when multi-threaded.

In conjunction with dask, see here: https://dask.readthedocs.org/

this can be a powerful way to easily use pandas out-of-core / threaded / multi-processing / distributed.

We are discussing how (and if) as SciPy to make this API accessible directly from pandas.

Reply all

Reply to author

Forward