Interaction terms in statsmodels regression?

christoph...@ucsf.edu

unread,

Sep 10, 2014, 8:58:04 PM9/10/14

to pystat...@googlegroups.com

Hi

I'm trying to build a regression model to tell me if two input variables are interacting.

Right now I'm doing this:

InteractionModelpVals=sm.OLS(response,covariateMatrix).fit().pvalues

It kind of works, but I created the interaction term as a new variable having the possible values of -3, -2, 2, or 3. This is maybe not the best. Or is that a legit way to create an interaction term?

In R you can literally multiply two terms together to get an interaction term and then stick it in your formula.
interactionTerm<-term1*term2
response~term1+term2+interactionterm
https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-11/recipe-11-6

Should I not be using OLS to fit such a model?
Bonus points for explaining to me a good way to incorporate categorical covariates into my models. (Is there a faster way than creating binary dummy variables for each category?)

Thanks!
Chris

josef...@gmail.com

unread,

Sep 10, 2014, 9:33:40 PM9/10/14

to pystatsmodels

Hi,

Use our. i.e. patsy's formulas, especially if you are familiar with R.

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html

See the patsy documentation for some differences between R and patsy, most definitions of formulas are very similar or the same.

Also use dmatrix or dmatrices to get a hold of the design matrix directly, to see if it's what you think it should be, and for reuse.

anova_lm is a function that provides anova tables after regression

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/interactions_anova.html

a few more examples are spread throughout the notebooks here

http://statsmodels.sourceforge.net/devel/examples/index.html

I hope that helps.

Josef

Thanks!
Chris

christoph...@ucsf.edu

unread,

Sep 17, 2014, 10:15:21 PM9/17/14

to pystat...@googlegroups.com

Thanks, Josef.
Just an update for any newbies out there seeing this.
I'm using pandas dataframes and they simplify the readability of the code.
For categorical variables just write "C(variable)" inside your model formula and then it's categorical (I think it automatically generates dummy variables).
Here's some code:

from statsmodels.formula.api import ols
...
ModelFormula="ResultVariable~C(dataBatch)+diet*votingHistory"# best interaction variable ever.
...collecting data in MergedDataFrame...
#here's the guts of my fit and extract the pvalues
InteractionModelpVals=ols(formula=ModelFormula, data=MergedDataFrame).fit().pvalues

peace,
Chris

Reply all

Reply to author

Forward