Share pandas dataframe or numpy array between workers

venkatesh...@gmail.com

unread,

Jun 11, 2015, 5:56:11 AM6/11/15

to luigi...@googlegroups.com

Hello,
How do I share a Pandas DataFrame or a NumPy array between workers running in parallel? I tried to pass them as parameters, but that does not work. Writing to csv and each worker calling read_csv() is not an option because of the size of the dataset. More generally, how do I pass arbitrary Python objects?

Thanks,
Venkatesh Halli.

Arash Rouhani

unread,

Jun 11, 2015, 6:00:31 AM6/11/15

to venkatesh...@gmail.com, luigi...@googlegroups.com

In a map reduce setting, isn't that just your input dataset? I mean the lines that will get passed into your map function.

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

venkatesh...@gmail.com

unread,

Jun 11, 2015, 9:59:42 AM6/11/15

to luigi...@googlegroups.com, venkatesh...@gmail.com

This is for a recommendation system. I have a cosine similarity matrix generated by a task and stored in RAM. In the next task, Each worker would pick up a row and compare it against other rows. For that, every worker would require read access to the entire matrix. I've been unable to pass a reference to this matrix using parameters.

felipe...@gmail.com

unread,

Jul 31, 2015, 5:06:49 PM7/31/15

to Luigi, venkatesh...@gmail.com

There is no standard way of doing this, because they are separated processes.

Even though there are a few options that you might try, like using a shared database like Redis, to maintain those objects in memory.

Ron Reiter

unread,

Aug 1, 2015, 1:39:43 AM8/1/15

to felipe...@gmail.com, Luigi, venkatesh...@gmail.com

Passing arbitrary Python objects doesn't make sense Luigi-wise because it means that running only that task would require specifying a Python object to that task, which is impossible.

What you want is to pickle Python & numpy objects and specifying the location of those pickled (serialized) files. Numpy has its own pickling format: http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html

- Ron

Reply all

Reply to author

Forward