How to work with categorical variables in pandas

ivan7707

unread,

Aug 20, 2015, 7:43:39 PM8/20/15

to PyData

Hi,

I have a column in my dataframe that has one of two values: ' 36 months' and ' 60 months'

I tried changing the column type so that I can run a logistic regression by doing this:

data['term'] = data['term'].astype('category')

But I get this error:

ValueError: could not convert string to float: '36 months'

When I do this:

data = data.replace([' 36 months'], '1')

data = data.replace([' 60 months'], '0')

Then the logistic regression works. Since I have a lot of categorical variables, do I have to do this all manually, or is there a better way?

I recall in R, using 'factor' I was able to convert the column and run the regression without changing the values (it has been a while since I used it, so I could be remembering wrong).

Thanks,

Ivan

Joris Van den Bossche

unread,

Aug 21, 2015, 4:45:14 AM8/21/15

to PyData

This works for me on 0.16.2:

In [51]: s = pd.Series(['36 months', '60 months', '36 months'])

In [52]: s.astype('category')
Out[52]:
0    36 months
1    60 months
2    36 months
dtype: category
Categories (2, object): [36 months, 60 months]

In [53]: pd.__version__
Out[53]: '0.16.2'

Can you provide a reproducible example that fails with that error?

Joris

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ivan7707

unread,

Aug 21, 2015, 10:24:01 AM8/21/15

to PyData

Hi Joris,

Thanks for looking at this for me.

Here it is:

IN

from sklearn.linear_model import LogisticRegression

import pandas as pd

data = pd.Series(['36 months', '60 months', '36 months'])

target = pd.Series([1,0,1])

target = pd.DataFrame({'dependent': pd.Series(target)})

data = pd.DataFrame({'independent': pd.Series(data)})

data['independent'] = data['independent'].astype('category')

In [37]:

data['independent']

Out[37]:

0 36 months

1 60 months

2 36 months

Name: independent, dtype: category

Categories (2, object): [36 months, 60 months]

In [36]:

model = LogisticRegression()

model.fit(data,target['dependent'])

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-36-bbbbfa54d9f6> in <module>()

1 model = LogisticRegression()

----> 2 model.fit(data,target['dependent'])

C:\Users\MGR17907\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y)

1015 % self.C)

1016

-> 1017 X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64, order="C")

1018 self.classes_ = np.unique(y)

1019 if self.solver not in ['liblinear', 'newton-cg', 'lbfgs']:

C:\Users\MGR17907\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric)

442 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,

443 ensure_2d, allow_nd, ensure_min_samples,

--> 444 ensure_min_features)

445 if multi_output:

446 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

C:\Users\MGR17907\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)

342 else:

343 dtype = None

--> 344 array = np.array(array, dtype=dtype, order=order, copy=copy)

345 # make sure we actually converted to numeric:

346 if dtype_numeric and array.dtype.kind == "O":

ivan7707

unread,

Aug 21, 2015, 10:26:22 AM8/21/15

to PyData

Forgot to mention,

When I do the following:

IN

data = data.replace([' 36 months'], '1')

data = data.replace([' 60 months'], '0')

Then the logistic regression works ok. Do I have to change all of my categorical variables to numbers (e.g. 1 and 0) to get my logistic regression to work?

Thanks for any ideas,

Ivan

John E

unread,

Aug 21, 2015, 11:14:45 AM8/21/15

to PyData

I haven't used sklearn but you can probably get it to work by creating dummies with "pd.get_dummies ( df.column )".

ivan7707

unread,

Aug 21, 2015, 11:48:30 AM8/21/15

to PyData

Hi John,

I came across get_dummies. I was hoping that changing the column to category would take care of that without me having to change or create new rows.

I recall in R, that all I had to do was change the column type to factor, and run the logistic regression. I was wondering if there was the same type of thing here.

Ivan

John E

unread,

Aug 21, 2015, 12:03:57 PM8/21/15

to PyData

Well, categorical is relatively new in pandas and still getting integrated throughout the ecosystem. You might have better luck asking this question at stack overflow or maybe try the statsmodels group https://groups.google.com/forum/#!forum/pystatsmodels although statsmodels may not handle this the same way, I think they use patsy for this sort of thing (?)

Jeff Reback

unread,

Aug 21, 2015, 12:22:24 PM8/21/15

to pyd...@googlegroups.com

latest version of patsy does handle auto conversions

--

josef...@gmail.com

unread,

Aug 21, 2015, 12:40:44 PM8/21/15

to pyd...@googlegroups.com

On Fri, Aug 21, 2015 at 12:22 PM, Jeff Reback <jeffr...@gmail.com> wrote:

latest version of patsy does handle auto conversions

On Aug 21, 2015, at 12:03 PM, John E <eil...@gmail.com> wrote:

Well, categorical is relatively new in pandas and still getting integrated throughout the ecosystem. You might have better luck asking this question at stack overflow or maybe try the statsmodels group https://groups.google.com/forum/#!forum/pystatsmodels although statsmodels may not handle this the same way, I think they use patsy for this sort of thing (?)

This might or might not work with statsmodels in the case of dependent categorical variables, e.g. Logit.

AFAIR, nobody checked and added unit tests, and statsmodels partially disagrees with patsy about conversion of dependent variables.

There should be no problem with categorical explanatory variables because patsy handles those in a consistent way for statsmodels.

Josef

ivan7707

unread,

Aug 21, 2015, 3:41:24 PM8/21/15

to PyData

Thanks for the info.

I'll take a look at patsy and statsmodels.

All the best,

Ivan

Patricio Del Boca

unread,

Aug 22, 2015, 7:40:19 PM8/22/15

to pyd...@googlegroups.com

AFAIK, you can't work with Categorical variables in the same way you work in R. In scikit-learn does not support pandas DataFrames with Categorical features. In order to use categorical features in scikit-learn models you need to preprocess them. There are several technics but the most common are OneHotEncoding and usin the method get_dummies from pandas. Scikit-Learn models only supports numerical variables.

A good starting point for those who came from R is to read: http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features

Here is a discussion in github about this topic: https://github.com/scikit-learn/scikit-learn/issues/4865

Patricio Del Boca

Ingeniero en Sistemas de Información - CSM

@pdelboca - medium.com/@pdelboca

ivan7707

unread,

Aug 24, 2015, 9:27:44 AM8/24/15

to PyData

Thanks Patricio.

I couldn't find any definite info about whether scikit-learn would support non-numerical variables.