How to work with categorical variables in pandas

1,093 views
Skip to first unread message

ivan7707

unread,
Aug 20, 2015, 7:43:39 PM8/20/15
to PyData
Hi, 

I have a column in my dataframe that has one of two values:  ' 36 months' and ' 60 months' 

I tried changing the column type so that I can run a logistic regression by doing this:

data['term'] = data['term'].astype('category')

But I get this error:

ValueError: could not convert string to float: '36 months'

When I do this:

data = data.replace([' 36 months'], '1')
data  = data.replace([' 60 months'], '0')

Then the logistic regression works.  Since I have a lot of categorical variables, do I have to do this all manually, or is there a better way?

I recall in R, using 'factor' I was able to convert the column and run the regression without changing the values (it has been a while since I used it, so I could be remembering wrong).

Thanks,

Ivan

Joris Van den Bossche

unread,
Aug 21, 2015, 4:45:14 AM8/21/15
to PyData
This works for me on 0.16.2:

In [51]: s = pd.Series(['36 months', '60 months', '36 months'])

In [52]: s.astype('category')
Out[52]:
0    36 months
1    60 months
2    36 months
dtype: category
Categories (2, object): [36 months, 60 months]

In [53]: pd.__version__
Out[53]: '0.16.2'


Can you provide a reproducible example that fails with that error?

Joris


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ivan7707

unread,
Aug 21, 2015, 10:24:01 AM8/21/15
to PyData
Hi Joris,  

 
Thanks for looking at this for me.

Here it is:

IN

from sklearn.linear_model import LogisticRegression
import pandas as pd

data = pd.Series(['36 months', '60 months', '36 months'])
target = pd.Series([1,0,1])

target = pd.DataFrame({'dependent': pd.Series(target)})
data = pd.DataFrame({'independent': pd.Series(data)})


data['independent'] = data['independent'].astype('category')


In [37]:

data['independent']

Out[37]:
0    36 months
1    60 months
2    36 months
Name: independent, dtype: category
Categories (2, object): [36 months, 60 months]

In [36]:

model = LogisticRegression()
model.fit(data,target['dependent'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-bbbbfa54d9f6> in <module>()
      1 model = LogisticRegression()
----> 2 model.fit(data,target['dependent'])

C:\Users\MGR17907\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y)
   1015                              % self.C)
   1016 
-> 1017         X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64, order="C")
   1018         self.classes_ = np.unique(y)
   1019         if self.solver not in ['liblinear', 'newton-cg', 'lbfgs']:

C:\Users\MGR17907\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric)
    442     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    443                     ensure_2d, allow_nd, ensure_min_samples,
--> 444                     ensure_min_features)
    445     if multi_output:
    446         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

C:\Users\MGR17907\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
    342             else:
    343                 dtype = None
--> 344         array = np.array(array, dtype=dtype, order=order, copy=copy)
    345         # make sure we actually converted to numeric:
    346         if dtype_numeric and array.dtype.kind == "O":

ivan7707

unread,
Aug 21, 2015, 10:26:22 AM8/21/15
to PyData
Forgot to mention,  

When I do the following:

IN

data = data.replace([' 36 months'], '1')
data  = data.replace([' 60 months'], '0')


Then the logistic regression works ok.  Do I have to change all of my categorical variables to numbers (e.g. 1 and 0) to get my logistic regression to work?

Thanks for any ideas,

Ivan

John E

unread,
Aug 21, 2015, 11:14:45 AM8/21/15
to PyData
I haven't used sklearn but you can probably get it to work by creating dummies with "pd.get_dummies ( df.column )".

ivan7707

unread,
Aug 21, 2015, 11:48:30 AM8/21/15
to PyData
Hi John,

I came across get_dummies. I was hoping that changing the column to category would take care of that without me having to change or create new rows.

I recall in R, that all I had to do was change the column type to factor, and run the logistic regression.  I was wondering if there was the same type of thing here.  

Ivan

John E

unread,
Aug 21, 2015, 12:03:57 PM8/21/15
to PyData
Well, categorical is relatively new in pandas and still getting integrated throughout the ecosystem.  You might have better luck asking this question at stack overflow or maybe try the statsmodels group  https://groups.google.com/forum/#!forum/pystatsmodels although statsmodels may not handle this the same way, I think they use patsy for this sort of thing (?)

Jeff Reback

unread,
Aug 21, 2015, 12:22:24 PM8/21/15
to pyd...@googlegroups.com
latest version of patsy does handle auto conversions 
--

josef...@gmail.com

unread,
Aug 21, 2015, 12:40:44 PM8/21/15
to pyd...@googlegroups.com
On Fri, Aug 21, 2015 at 12:22 PM, Jeff Reback <jeffr...@gmail.com> wrote:
latest version of patsy does handle auto conversions 

On Aug 21, 2015, at 12:03 PM, John E <eil...@gmail.com> wrote:

Well, categorical is relatively new in pandas and still getting integrated throughout the ecosystem.  You might have better luck asking this question at stack overflow or maybe try the statsmodels group  https://groups.google.com/forum/#!forum/pystatsmodels although statsmodels may not handle this the same way, I think they use patsy for this sort of thing (?)

This might or might not work with statsmodels in the case of dependent categorical variables, e.g. Logit.
AFAIR, nobody checked and added unit tests, and statsmodels partially disagrees with patsy about conversion of dependent variables.

There should be no problem with categorical explanatory variables because patsy handles those in a consistent way for statsmodels.

Josef

ivan7707

unread,
Aug 21, 2015, 3:41:24 PM8/21/15
to PyData
Thanks for the info.

I'll take a look at patsy and statsmodels.  

All the best,

Ivan 

Patricio Del Boca

unread,
Aug 22, 2015, 7:40:19 PM8/22/15
to pyd...@googlegroups.com
AFAIK, you can't work with Categorical variables in the same way you work in R. In scikit-learn does not support pandas DataFrames with Categorical features. In order to use categorical features in scikit-learn models you need to preprocess them. There are several technics but the most common are OneHotEncoding and usin the method get_dummies from pandas. Scikit-Learn models only supports numerical variables.

A good starting point for those who came from R is to read: http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features

Here is a discussion in github about this topic: https://github.com/scikit-learn/scikit-learn/issues/4865

Patricio Del Boca
Ingeniero en Sistemas de Información - CSM

ivan7707

unread,
Aug 24, 2015, 9:27:44 AM8/24/15
to PyData
Thanks Patricio. 

I couldn't find any definite info about whether scikit-learn would support non-numerical variables.  

I'll check out the links. 

Ivan 
Reply all
Reply to author
Forward
0 new messages