pandas and multiprocessing?

Richard Stanton

unread,

Apr 5, 2013, 1:13:55 PM4/5/13

to pyd...@googlegroups.com

In my current application, I'm working with some largish (up to, say, 6GB)
data sets, which I read into a pandas DataFrame. I then want to run some
estimation code, where processing time is an issue. To reduce processing
time, I'd like to use multiple processes (using the multiprocessing
package) to allocate the work to multiple cores (8, say) on my machine,
but while I have enough RAM to keep one copy of the DataFrame in memory, I
can't afford to keep 8 copies in memory at once, so I somehow need to
share a single copy of the DataFrame across multiple processes (I don't
mind having read-only access from each process).

Has anyone addressed (or even better, solved) a problem like this?
Ideally, supposing the data were on 1,000 different stocks (it's not stock
data, but an example seems like a good idea), I'd do a groupby to split
the dataframe into individual stocks, and then somehow allocate the
processing of each group to a different processor, but again this would
somehow need to happen without creating 7 extra copies of the entire
dataframe.

Thanks for any suggestions.

Richard Stanton

Jeff

unread,

Apr 5, 2013, 1:27:37 PM4/5/13

to pyd...@googlegroups.com

Going to hopefully build something like this in 0.12

https://github.com/pydata/pandas/issues/3202

Depending on your needs some of the following may be useful:

http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore

http://pandas.pydata.org/pandas-docs/dev/io.html#iterator

Sharing via multiprocessing and almost always makes copies of numpy type

data anyhow.

See the above links, but a scalable solution is to:

store in HDF

work on what you need by querying a part of the dataset

multi-processing will work here as the HDFStores are multi-process readable

(BUT don't write in mult-process)!

Jeff

Nathaniel Smith

unread,

Apr 8, 2013, 3:13:09 PM4/8/13

to pyd...@googlegroups.com

What OS are you on? On POSIX (but not Windows), there's a cheap trick
that works great when you can get away with it: when you spawn a child
process via fork() (as multiprocessing does), then the child process
ends up with with *looks* like a complete copy of the parent process's
memory -- but the operating system implements this using a sneaky
virtual memory hack, where the memory isn't actually copied until it's
written to (the term of art is "copy-on-write").

The easiest way to do this is to stash your data in a global variable,
then use multiprocessing to spawn some workers in the usual way -- but
have them access the data via the global, instead of via explicit
message passing. Note that (1) the data will be writeable in the
children, but (2) writes made in one child won't be visible in the
others, and (3) when you write to the data, your actual memory usage
will go up. So you should pretend it's read-only.

One possible spanner in the works is that every time you access a
*Python* object, its refcount gets updated -- which is a write
operation. But this only a big deal if you have many small, Python
objects -- if you have one Python object that you use to access a
giant pile of memory, then it's no big deal. This might interact badly
with pandas's habit of using the 'object' dtype though :-/

-n

> --
> You received this message because you are subscribed to the Google Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Reply all

Reply to author

Forward