regression OLS: No reference for ordered categories

27 views
Skip to first unread message

c.b...@posteo.jp

unread,
Jun 28, 2023, 9:20:31 AM6/28/23
to Pystatsmodels
Hello,

I try to understand a specific aspect of the summary output of
statsmodels.formula.api.ols().

==================================================================================
coef std err t P>|t| [0.025
0.975]
----------------------------------------------------------------------------------
Intercept 53.0026 1.877 28.235 0.000 49.319
56.686
foo[T.b] -0.1790 2.239 -0.080 0.936 -4.572
4.214
foo[T.c] -2.8599 2.257 -1.267 0.205 -7.289
1.570
gender[T.male] -2.7580 1.812 -1.522 0.128 -6.313
0.797
scale(age) 0.3216 0.906 0.355 0.723 -1.456
2.099
==============================================================================

The variable "foo" is an ordered categorical.
The variable "gender" is also categorical but not ordered.

The reference for "gender" is obviously "female".
But without knowing the raw data it is not clear what the reference is
for "foo". It is "a".

Is there a way to make the reference visible in that summary output? For
example like this:

Intercept 53.0026 1.877 28.235 0.000 49.319
56.686
foo[T.a] Reference
foo[T.b] -0.1790 2.239 -0.080 0.936 -4.572
4.214
foo[T.c] -2.8599 2.257 -1.267 0.205 -7.289
1.570
gender[T.female] Reference
gender[T.male] -2.7580 1.812 -1.522 0.128 -6.313
0.797
scale(age) 0.3216 0.906 0.355 0.723 -1.456
2.099

Here is a full MWE reproducing the output. Thanks in advance.

#!/usr/bin/env python3
import random
import pandas
import statsmodels.formula.api as smapi


random.seed(0)
k = 1000
df = pandas.DataFrame({
'oid': range(1000),
'gender': random.choices(['female', 'male'], k=k),
'foo': random.choices(['a', 'b', 'c'], k=k),
'age': random.choices(range(18, 100), k=k),
'outcome_score': random.choices(range(100), k=k),
})
df.gender = df.gender.astype('category')
df.foo = df.foo.astype(pandas.CategoricalDtype(['a', 'b', 'c'],
ordered=True))

model = smapi.ols(
formula='outcome_score ~ scale(age) + foo + gender',
data=df)
result = model.fit()
summary = result.summary()

print(summary)

josef...@gmail.com

unread,
Jun 28, 2023, 11:10:58 AM6/28/23
to pystat...@googlegroups.com
Unfortunately, It's not supported and it would be very difficult to figure out what the missing reference levels are.

Formulas are handled by patsy and our models only get the final design_matrix with some design_info.

For simple (mainl only) effects, it would still be possible to figure out from the design_info what the reference level is.
For interactions, that would be very difficult (AFAICS) without changing or replicating some of the contrast coding in patsy.
Furthermore, it would also need to handle different encoding schemes.

Patsy does not have an option to NOT drop reference categories, so we cannot get the full, although singular, design matrix.
If we could get that, then we could use `fit_constrained` (in e.g. GLM) to impose that the parameters for the relevant levels are zero.
(fit_constrained results returns all parameters even if they are constrained to specific values or satisfy collinear constraints, at the cost of a singular cov_params.)

Maybe this will change when we add or switch to formulaic as replacement for patsy. But I expect that this will not be soon.

Josef

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/694dd3659dbc9c9ed43ed434a9782d69%40posteo.de.
Reply all
Reply to author
Forward
0 new messages