Categorical regression using multiple categoricals from the same list

127 views
Skip to first unread message

Alex

unread,
Mar 13, 2015, 9:53:16 AM3/13/15
to pystat...@googlegroups.com

I have the following problem:

I have two lists of data: (var1, var2). However, var1 is categorical information. I want to use var1 and a regression model to predict what var2 would be. My actual dataset is much larger and the cat. info listed below doesnt make much sense in this snippet. What I have right now:
 
import numpy as np   
import pandas as pd
import statsmodels.formula.api as smf
 
#Creating dataset    
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=15, freq='H'))    
df.loc[:,'var1'] = [6.2,7.3,4.4,2.1,6.8,7.5,8.9,9.4,3.6,4.2,5.8,9.2,9.0,2.2,3.4]    
df.loc[:,'var2'] = df.loc[:,'var1'] + np.random.uniform(-2.1,1.4,15)    
df.loc[:,'cat'] = [3,3,2,1,3,1,0,0,2,2,3,0,3,2,3]    
 
#Performing regression
formula = 'var1 ~ C(cat):var2 + C(cat):I(var2**2)'    
model = smf.quantreg(formula=formula, data=df)    
res = model.fit(q=0.5)    
df.loc[:,'prediction'] = res.predict(df)    
print(res.summary())
 

What I want is to perform the regression in a similiar manner, except I want the categorical information to include the category above and below it. e.g. I want to regress for categories (0,1,2),(1,2,3),(2,3,0) and (3,0,1) instead of 0,1,2,3. The reason for doing this is a lack of datapoints in individual categoricals.

Is this a clean way of doing this in python?

josef...@gmail.com

unread,
Mar 13, 2015, 10:07:05 AM3/13/15
to pystatsmodels

I was hoping somebody will answer who understands the encodings. I never tried to figure out the details.


to clarify and understand this:
" I want to regress for categories (0,1,2),(1,2,3),(2,3,0) and (3,0,1) instead of 0,1,2,3"

Does this mean that the first column/dummy should have ones if `cat` in {0, 1, 2} and zero if cat == 3,
and analogously for the others?

In that case it would be just 1 - dmatrix('(cat) - 1")  mixing numpy and patsy
i.e. just negate/reverse every dummy variable.

I don't know what a pure patsy formula way for creating this would be.

or do you want something else ?

Josef

Alex

unread,
Mar 13, 2015, 10:40:30 AM3/13/15
to pystat...@googlegroups.com
I'm sorry if its a little vague, I'll try to clarify. In my current dataset the number of datapoints in each category isnt always large enough to perform a proper regression (due to outliers). Unlike this snippet my real dataset consists of 16 categories. I want to create a different regression line/model for each category (different params for each category). In order to have enough datapoints for say.. category 7 I want to perform the regression using the datapoints from categories 6,7 and 8 (the categories above, below and including category 7).
 
Are you saying I should do the following (on my example dataset)?
 
a = np.array([[ 1., 1., 1., 0.],

[ 0., 1., 1., 1.],

[ 0., 0., 1., 1.],

[ 1., 0., 0., 1.]])

formula = 'var1 ~ C(cat,a):var2 + C(cat,a):I(var2**2)'

 

Op vrijdag 13 maart 2015 15:07:05 UTC+1 schreef josefpktd:

josef...@gmail.com

unread,
Mar 13, 2015, 10:55:35 AM3/13/15
to pystatsmodels
On Fri, Mar 13, 2015 at 10:40 AM, Alex <alexvanc...@gmail.com> wrote:
I'm sorry if its a little vague, I'll try to clarify. In my current dataset the number of datapoints in each category isnt always large enough to perform a proper regression (due to outliers). Unlike this snippet my real dataset consists of 16 categories. I want to create a different regression line/model for each category (different params for each category). In order to have enough datapoints for say.. category 7 I want to perform the regression using the datapoints from categories 6,7 and 8 (the categories above, below and including category 7).
 
Are you saying I should do the following (on my example dataset)?
 
a = np.array([[ 1., 1., 1., 0.],

[ 0., 1., 1., 1.],

[ 0., 0., 1., 1.],

[ 1., 0., 0., 1.]])

formula = 'var1 ~ C(cat,a):var2 + C(cat,a):I(var2**2)'



I'm not saying it, since I don't know. But if it works, then it's fine  http://patsy.readthedocs.org/en/latest/categorical-coding.html


However, what I would do if you need to use this more often, is to use pandas to define a new categorical variable that defines the merged cells, e.g. a string variable with new level names.
I think pandas allows to use a dictionary to define the mapping, I never used that either, but have seen examples somewhere.

Josef

Nathaniel Smith

unread,
Mar 13, 2015, 4:00:50 PM3/13/15
to pystatsmodels
[sigh]

On Fri, Mar 13, 2015 at 12:58 PM, Nathaniel Smith <n...@pobox.com> wrote:
> The first problem you have to solve is that technically what you want
> is not, mathematically, regression :-). The issue is that regression
> is basically about solving a credit assignment problem: I have a bunch
> of predictors that might be affecting this single measurement; which
> predictors are responsible for how much? So if you take what would
> seem like the simplest approach, of letting a single measurement for
> category 2 affect your estimation of the (0,1,2) parameter and the
> (1,2,3) parameter at the same time, then the regression model will try
> to uniquely assign the variance in your measurement to *one* of these
> parameters.
>
> (Another way to think about this: you are trying to stretch your data
> further by replacing 4 degrees of freedom with... 4 degrees of
> freedom. If you don't have enough data to estimate 4 quantities, then
> switching around which 4 quantities you're trying to estimate probably
> won't help much.)
>
> There are two ways forward that I see:
>
> The "get a bigger hammer" approach: split up your data into
> overlapping bins "by hand", and do a separate regression like "var1 ~
> var2 + I(var2**2)" on each bin.
>
> The "endorsed by regression textbooks" approach: use something like a
> spline basis to parametrize the effect of your categories. The idea
> here is that you fit a smooth function of your categories, so that
> there's only one value f(2) that is taken into account when trying to
> predict a measurement made at category 2, but b/c the function is
> constrained to be low-dimensional and smooth that value has to be
> close to f(1) and f(3), so data measured on categories 1 and 3 will
> also affect your estimates for 2. Something like e.g.
>
> var1 ~ cc(cat, df=2):(var2 + I(var2**2))
>
> for a 2 degree-of-freedom cyclic spline basis (cyclic b/c it sounds
> like you want to treat category 3 as being next to category 0?).
>
> -n
>
> --
> Nathaniel J. Smith -- http://vorpus.org



--
Nathaniel J. Smith -- http://vorpus.org

josef...@gmail.com

unread,
Mar 13, 2015, 4:15:50 PM3/13/15
to pystatsmodels
I'm not sure what the problems is, so a simpler question:

Nathaniel: What's the patsy way of merging levels?

For example I have ethnicity: white, black, asian, latino and others  (substitute official names of ethnicity)
I want to keep white and black and merge all others, since there are too few observations in the other levels.

I'd like to get the usual categorical encoding, or the full dummy coding.

Josef

Nathaniel Smith

unread,
Mar 13, 2015, 4:22:43 PM3/13/15
to pystatsmodels
On Fri, Mar 13, 2015 at 1:22 PM, Nathaniel Smith <n...@pobox.com> wrote:
>> I'm not sure what the problems is, so a simpler question:
>>
>> Nathaniel: What's the patsy way of merging levels?
>>
>> For example I have ethnicity: white, black, asian, latino and others
>> (substitute official names of ethnicity)
>> I want to keep white and black and merge all others, since there are too few
>> observations in the other levels.
>
> There aren't any tools for this out-of-the-box in patsy... patsy
> focuses on the categorical -> dmatrix stuff, and this is really just a
> question about manipulating categorical variables ahead of time. If
> there's some nice would-be-commonly-used utility function that could
> be added to patsy then we could do that, but in general the answer is
> to use pandas or something that's designed for general data
> manipulation. Or even something really low-tech like a simple list
> comprehension:
>
> df["CollapsedEthnicity"] = [ethnicity if ethnicity in ["White",
> "Black"] else "Other" for ethnicity in df["Ethnicity"]]

josef...@gmail.com

unread,
Mar 13, 2015, 4:47:32 PM3/13/15
to pystatsmodels
This is mainly a categorical -> dmatrix stuff
I thought there might be a custom encoding to accomplish this (sum of some dummy columns)

Related: imposing a linear constraint on the parameters params['Latino'] == params['Asian']  can be done through reparameterization of the design matrix.
This reparameterization for linear constraints is available in `fit_constrained` method, currently only in models that have an offset option since the reparameterization requires offset in the general case.

Related: instead of imposing the constraints as equality we can impose them as stochastic constraint similar to a Bayesian prior params['Latino'] - params['Asian'] is normal(0, sigma_p)
PR not yet merged.

Josef

Nathaniel Smith

unread,
Mar 13, 2015, 4:58:07 PM3/13/15
to pystatsmodels
On Fri, Mar 13, 2015 at 1:47 PM, <josef...@gmail.com> wrote:
>
> On Fri, Mar 13, 2015 at 4:22 PM, Nathaniel Smith <n...@vorpus.org> wrote:
>>
>> On Fri, Mar 13, 2015 at 1:22 PM, Nathaniel Smith <n...@pobox.com> wrote:
>> > On Fri, Mar 13, 2015 at 1:15 PM, <josef...@gmail.com> wrote:
>> >> Nathaniel: What's the patsy way of merging levels?
>> >>
>> >> For example I have ethnicity: white, black, asian, latino and others
>> >> (substitute official names of ethnicity)
>> >> I want to keep white and black and merge all others, since there are
>> >> too few
>> >> observations in the other levels.
>> >
>> > There aren't any tools for this out-of-the-box in patsy... patsy
>> > focuses on the categorical -> dmatrix stuff, and this is really just a
>> > question about manipulating categorical variables ahead of time. If
>> > there's some nice would-be-commonly-used utility function that could
>> > be added to patsy then we could do that, but in general the answer is
>> > to use pandas or something that's designed for general data
>> > manipulation. Or even something really low-tech like a simple list
>> > comprehension:
>> >
>> > df["CollapsedEthnicity"] = [ethnicity if ethnicity in ["White",
>> > "Black"] else "Other" for ethnicity in df["Ethnicity"]]
>
>
> This is mainly a categorical -> dmatrix stuff
> I thought there might be a custom encoding to accomplish this (sum of some
> dummy columns)

Yeah, I guess you could do it that way too. Pick your favorite
encoding matrix for Black/White/Other, and then use it to make a
custom encoding matrix by duplicating the Other row for
Asian/Latinx/etc. Again patsy won't particularly help you do this (if
you want a language for doing arbitrary ad hoc data manipulation then
that's what Python is for :-)), but it won't put any barriers in your
way either.

Alex

unread,
Mar 16, 2015, 6:00:43 AM3/16/15
to pystat...@googlegroups.com
Thanks for this reaction, it was very useful. For now, the bigger hammer approach seems to do exactly what I wanted, but it's some of the most disgusting programming I've ever done with very slow loops everywhere. I also just wanted you to know that in fact, in my actual code there are multiple predictors, its just one predictor that lacks datapoints in some of the categories. This is why I wanted to combine this predictor with the adjacent categories that are somewhat similar in order to get enough datapoints.
 

Op vrijdag 13 maart 2015 21:00:50 UTC+1 schreef Nathaniel Smith:
Reply all
Reply to author
Forward
0 new messages