Pickling of patsy DesignMatrixBuilder instances

430 views
Skip to first unread message

Christian Hudon

unread,
Jan 13, 2014, 3:44:57 PM1/13/14
to pyd...@googlegroups.com
Hi again,

I'm trying to use patsy to convert a pandas DataFrame into a matrix that can be handed off to a neural network. The training of said neural network is done in a separate process (batch) from the prediction requests. So I need to save the DesignMatrixBuilder instances that I get from applying the patsy formula to the training set, so they can be applied to predict requests. However, when I try to pickle the DesignMatrixBuilder instances, I get a PicklingError.

Here is an example that shows the problem:

----cut here----

from patsy import dmatrices, demo_data
import pickle
import pandas as pd

dataset = pd.DataFrame(demo_data('a', 'b', 'c', 'x', 'y1', 'y2'))
formula = "y1 ~ x"

target_matrix, input_matrix = dmatrices(formula, dataset)
pickle.dumps(input_matrix.design_info.builder)

----cut here----

My code is more involved and fails with a different PickleError (trying to pickle the random module), but I checked in the stack trace and it's failing while trying to pickle a DesignMatrixBuilder instance. Maybe the EvalFactor causes the pickle module to try to pickle the world? Anyways, will it be possible DesignMatrixBuilder instances, or should I give up and use something else instead? (I'd really like to be able to use R-like formulas to configure my models, though.)

Thanks again,

  Christian

PS Thank you for the pointer to the Q() function. Maybe it could be mentioned a bit more prominently in the docs? It was easy to miss...

dartdog

unread,
Jan 13, 2014, 4:37:40 PM1/13/14
to pyd...@googlegroups.com
Wild guess, a number of people have had issues if they are using the --pylab option in IPython Notebook, don't.

Christian Hudon

unread,
Jan 14, 2014, 10:40:23 AM1/14/14
to pyd...@googlegroups.com
I wish the problem was that simple (I was using IPython Notebook), but it isn't. Running the sample I supplied in a plain Python interpreter also produces an error: "TypeError: can't pickle module objects."

Is anyone pickling DesignMatrixBuilder instances successfully?

  Christian

Christian Hudon

unread,
Jan 15, 2014, 11:42:29 AM1/15/14
to pyd...@googlegroups.com
I just want to add that I also tried with a modified version of my example that doesn't use pandas, and the exception is still raised, so the problem doesn't involve pandas.

Thanks,

  Christian

Skipper Seabold

unread,
Jan 15, 2014, 11:57:38 AM1/15/14
to pyd...@googlegroups.com
On Mon, Jan 13, 2014 at 3:44 PM, Christian Hudon <chr...@pianocktail.org> wrote:
> Hi again,
>
> I'm trying to use patsy to convert a pandas DataFrame into a matrix that can
> be handed off to a neural network. The training of said neural network is
> done in a separate process (batch) from the prediction requests. So I need
> to save the DesignMatrixBuilder instances that I get from applying the patsy
> formula to the training set, so they can be applied to predict requests.
> However, when I try to pickle the DesignMatrixBuilder instances, I get a
> PicklingError.
>
> Here is an example that shows the problem:
>
> ----cut here----
>
> from patsy import dmatrices, demo_data
> import pickle
> import pandas as pd
>
> dataset = pd.DataFrame(demo_data('a', 'b', 'c', 'x', 'y1', 'y2'))
> formula = "y1 ~ x"
>
> target_matrix, input_matrix = dmatrices(formula, dataset)
> pickle.dumps(input_matrix.design_info.builder)
>

A few quick thoughts.

1) Why are you using formulas? If you're not taking advantage of
interactions, automatic categorical handling, etc. you don't need
them.
2) Why not just pickle the DataFrame and the formula string and have
your code call dmatrices after you load them.

Skipper

Nathaniel Smith

unread,
Jan 15, 2014, 12:02:19 PM1/15/14
to pyd...@googlegroups.com
On Tue, Jan 14, 2014 at 3:40 PM, Christian Hudon <chr...@pianocktail.org> wrote:
> I wish the problem was that simple (I was using IPython Notebook), but it
> isn't. Running the sample I supplied in a plain Python interpreter also
> produces an error: "TypeError: can't pickle module objects."
>
> Is anyone pickling DesignMatrixBuilder instances successfully?

Yes, unfortunately at the moment DesignMatrixBuilders cannot be
pickled. This would obviously be a great thing to have, but it's not
trivial, both because it needs some thought to make sure that future
versions (which might have different internals) will be able to load
pickles created with earlier versions, and because it's not clear how
to pickle an execution environment. (If you want to interpret a
formula like "y ~ np.log(x)", it's not enough to know the formula
string; you also need to know that "np" needs to be loaded and
available.)

-n

josef...@gmail.com

unread,
Jan 15, 2014, 12:17:05 PM1/15/14
to pyd...@googlegroups.com
Related: models in statsmodels that use (pandas and) formulas cannot be pickled.
unit tests are only for numpy array based models.
I don't know if models with DataFrame but without formulas can be pickled.

Josef

>
> Skipper
>
> --
> You received this message because you are subscribed to the Google Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Christian Hudon

unread,
Apr 15, 2014, 2:27:46 PM4/15/14
to pyd...@googlegroups.com


Le mercredi 15 janvier 2014 11:57:38 UTC-5, Skipper Seabold a écrit :
A few quick thoughts.

1) Why are you using formulas? If you're not taking advantage of
interactions, automatic categorical handling, etc. you don't need
them.
2) Why not just pickle the DataFrame and the formula string and have
your code call dmatrices after you load them.

Sorry for the late reply.

For 1), I *am* taking advantage of automatic categorical handling, stateful transforms, interactions (sometimes),  etc. The vast majority of features that are useful in creating a design matrix for statistics are also useful in data mining too. For me, most of the dataset I've worked with have had some form of categorical data, so patsy's handling of that in the formula is really useful. And in general, being able to change the representation of the dataset I'm feeding to machine learning algorithms by tweaking a formula instead of a bunch of code is something that's really useful. The only feature of patsy's that not quite as useful in a data mining context is avoiding redundancies in the design matrix.

For 2), the contents of the DataFrame is very big, and the design matrix will be needed by multiple processes at the same time (when exploring the hyperparameter space). So I'm saving the design matrix into a .npy file that can then be memmapped by all these processes (so said design matrix isn't sitting in memory multiple times). So your suggestion (although appreciated) wouldn't work for this.

Thanks,

  Christian

Christian Hudon

unread,
Apr 15, 2014, 2:50:23 PM4/15/14
to pyd...@googlegroups.com, n...@pobox.com


Le mercredi 15 janvier 2014 12:02:19 UTC-5, Nathaniel Smith a écrit :

Yes, unfortunately at the moment DesignMatrixBuilders cannot be
pickled. This would obviously be a great thing to have, but it's not
trivial, both because it needs some thought to make sure that future
versions (which might have different internals) will be able to load
pickles created with earlier versions, and because it's not clear how
to pickle an execution environment. (If you want to interpret a
formula like "y ~ np.log(x)", it's not enough to know the formula
string; you also need to know that "np" needs to be loaded and
available.)

Having both of those criteria (compatibility with future versions and and pickling the whole execution environment) would be nice, but I think there are a bunch of very useful use cases where those are not necessary. I feel that the second one is only necessary when loading pickled DesignMatrixBuilders from totally unrelated programs. When using in related programs, it's not problem at all making sure that whatever environment is required to interpret the formulas is available in the programs that load the pickles. And a good fraction of the time, said required environment will be only patsy + numpy anyways.

As for the "compatibility with future versions" requirement, it would be totally fine for my usecase to have to redo the pickle when the version of patsy changes. (Other use cases, like multiprocessing, care even less about that.)

In short, versions that don't reach the end goal of "compatibility with future versions and and pickling the whole execution environment" would be very useful too, and I hope you'd be willing to consider getting to said end goal in multiple steps, instead of waiting until these two desiderata are satisfied before adding pickling support.

Thanks,

  Christian

Skipper Seabold

unread,
Apr 15, 2014, 3:10:01 PM4/15/14
to pyd...@googlegroups.com
One possibility?

https://pypi.python.org/pypi/dill

Skipper
Reply all
Reply to author
Forward
0 new messages