PatsyError: Error converting data to categorical

1,018 views
Skip to first unread message

Sam

unread,
Nov 16, 2016, 7:32:17 AM11/16/16
to pystatsmodels
Hi all,

New to this group so don't hesitate to let me know if my question isn't very clear. I'm a new user of statsmodels and I'm loving it but I've run into the issue described in this StackOverflow question:

http://stackoverflow.com/questions/34035912/patsy-new-levels-in-categorical-fields-in-test-data

Basically, when doing cross validation for an OLS model, some categories may not be present in the subset of the data used to fit the model. The following error is then raised when trying to predict the outcome for an input that contains unexpected categories:

```
/anaconda/envs/house/lib/python3.5/site-packages/patsy/categorical.py in categorical_to_int(data, levels, NA_action, origin)
    360                                  "observation with value %r does not match "
    361                                  "any of the expected levels (expected: %s)"
--> 362                                  % (value, level_str), origin)
    363             except TypeError:
    364                 raise PatsyError("Error converting data to categorical: "

PatsyError: Error converting data to categorical: observation with value 'Metal' does not match any of the expected levels (expected: ['ClyTile', 'CompShg', ..., 'WdShake', 'WdShngl'])

```

I guess one solution would be to drop all the observations that have values not present in the model. Is there a better way of dealing with this?

Thanks!

Sam

josef...@gmail.com

unread,
Nov 16, 2016, 8:32:58 AM11/16/16
to pystatsmodels
I don't know what is usually used for standard cross-validation in this case. I would select only training samples that contain all variables.


A simple way to work with patsy, AFAICS from your example, is to apply the formula to the entire data set before splitting. I think this needs to be done whenever there are stateful transformation or categorical variables, so that there is that a design matrix is created that is consistent across subsamples.
After that, there is still the problem that the training sample will not have full rank in all k-fold subsamples. OLS handles this by default using the generalized inverse pinv which produces some regularized estimate for the coefficients that are collinear or are not identified by the data..

Josef

 

Thanks!

Sam

Sam

unread,
Nov 20, 2016, 12:22:55 PM11/20/16
to pystatsmodels

Many thanks Josef, I tried using design matrices as you suggested and it worked perfectly!

Sam

 

Thanks!

Sam

Reply all
Reply to author
Forward
0 new messages