exog / endog

352 views
Skip to first unread message

VincentAB

unread,
Jul 22, 2012, 8:54:06 PM7/22/12
to pystat...@googlegroups.com
This thread is for discussion of statsmodels' exog/endog naming convention.

This post summarizes this exchange: https://github.com/statsmodels/statsmodels/issues/395

Status quo
========

* Most models take 2 design matrices as arguments. Currently, they are named "exog" and "endog"

Problems
=======

* Feedback from multiple new users: the "endog" and "exog" names are a stumbling block for `statsmodels` adoption. These labels are often encountered in econometrics, but not in many other areas of applied statistics (even by users who have a strong stats background).
* In many contexts, the endog/exog distinction is ambiguous or a misnomer (e.g. models with endogenous regressors).

Option 1: Improving on the status quo
============================

* explanation for endog and exog should be more prominent (e.g. on the getting started page)

Benefits:

* Much less work!
* exog/endog is informative after you've learned what it means
* Moving away would be a big break with backwards compatibility and the current style
* "I thought of endog/exog as a nice distinguishing "feature" of statsmodels which also signals consistency across models."

Option 2: y, X , Z
=============

Benefits:

* Ubiquitous/Standard: used in lots of textbooks and by nearly all competitor software (e.g. R, Matlab, Stata)
* Immediately understandable by most; no longer a stumbling block for new users
* Consistent with related Python work (e.g. Pandas, PySal)

Problems:

* Non-pythonic
* Any introspection in a class that has one letter (or one greek letter) attributes is mostly useless.
* Non-expressive? "It always takes me an extra half a minute to figure out which is which, A, x, b for example." "Try to read the VAR code without the text book."

Response:

* "I don't think this is as drastically non-expressive as it seems on first glance. It's nearly universal. I'm not wild about one variable names either, it's non-pythonic, but I'm starting to come around to the viewpoint that endog and exog are an unnecessary technical/jargon hurdle."
* One-letter names in code are a completely different issue. "No one is arguing for using more one-letter variables all over the place, just x, y [and maybe z]."

Option 3: indep_var / dep_var
======================

Benefits:

* Not single letter
* More informative than X,y?

Problems:

* Sometimes ambiguous or misnomer.

Other stuff
========

* Is Formula integration a good or bad time to do this?
* IIRC, there was a brief message exchange that the sklearn developers are not so happy with the one letter names anymore.
* "How about moving to X & Y but describing them as exogenous & endogenous in the docs? That way the library keeps its econ heritage but moves to a generic naming convention."

Matthew Brett

unread,
Jul 22, 2012, 10:35:30 PM7/22/12
to pystat...@googlegroups.com
Hi,
Strongly in favor of Option2 - unambiguous and clear to a wide range
of backgrounds (I believe). Compatibility with other software seems
an obvious win.

I can never remember which one is which of endog / exog. Dependent /
independent is closer to my own tradition but not less inherently
confusing, in my opinion.

Any reason to prefer 'y' over 'Y'? Or the other way round? I guess
Y can sometimes be a matrix?

Thanks for bringing up the discussion.

Best,

Matthew

josef...@gmail.com

unread,
Jul 22, 2012, 10:45:06 PM7/22/12
to pystat...@googlegroups.com
simple: exog has an x in it

Josef

josef...@gmail.com

unread,
Jul 22, 2012, 11:19:31 PM7/22/12
to pystat...@googlegroups.com
Just as a reminder: We are trying to write a python library with full
classes and a coverage much more than linear models, not just a single
function that does OLS (or collection of functions)

compare introspection in Python with introspection in Matlab, Stata,
and R (and names with Java or Csharp conventions(?))

as any paper in econometric, we will need

x
x_star
x_boldface
x_cursive
...
z_that_is_a_transformed_x
x_shortened
...

Josef

josef...@gmail.com

unread,
Jul 22, 2012, 11:36:19 PM7/22/12
to pystat...@googlegroups.com
On Sun, Jul 22, 2012 at 10:35 PM, Matthew Brett <matthe...@gmail.com> wrote:
endog is a 1d or 2d array depending on the model and usage (until now
it's most of the time but not always 1d)
exog can be None, 1d, or 2d, in most cases it's 2d

Josef

Alexandre Gramfort

unread,
Jul 23, 2012, 2:52:23 AM7/23/12
to pystat...@googlegroups.com
hi,

of course I am a bit biased by the scikit-learn naming convention but if I may
share my experience, I have to look at the OLS example to map exog -> X and
endog -> y every time I use statsmodels (but I may not use it enough :) ).
exog and endog does not ring a bell to me. X and y in scikit-learn is not ideal
but it's easy to remember and that's the only variables that we tolerate
as single letters. I think it's a good thing to avoid domain specific jargon
as much as possible in a project that has such a widespread use.

my 2c

Alex

Alan G Isaac

unread,
Jul 23, 2012, 8:28:40 AM7/23/12
to pystat...@googlegroups.com
Just a reminder that we already hashed this all out long ago.

Alan Isaac

josef...@gmail.com

unread,
Jul 23, 2012, 8:37:34 AM7/23/12
to pystat...@googlegroups.com
On Mon, Jul 23, 2012 at 8:28 AM, Alan G Isaac <ais...@american.edu> wrote:
> Just a reminder that we already hashed this all out long ago.

This is mostly a reminder for 3 econometricians. I'm still convinced
that we took the right decision.
But I think we didn't explain this well enough in the documentation.

Josef

>
> Alan Isaac

Wes McKinney

unread,
Jul 23, 2012, 8:37:33 AM7/23/12
to pystat...@googlegroups.com
On Mon, Jul 23, 2012 at 8:28 AM, Alan G Isaac <ais...@american.edu> wrote:
> Just a reminder that we already hashed this all out long ago.
>
> Alan Isaac

Yes, but the project needs to be prepared to make some changes in the
best interests of the project's future given the growth of the user
base. The difference between 100 (then) and 10k (now) or 100k users is
significant. Don't you want more users? Out of more users comes more
developers, too. It seems to me like a small price to pay.

For my part, I've found that exog/endog is a problem when teaching
statsmodels to a non-econometric crowd. Even after using the library
for a long time, the names still cause me cognitive dissonance.

- Wes

Skipper Seabold

unread,
Jul 23, 2012, 10:00:59 AM7/23/12
to pystat...@googlegroups.com
Yes, to echo and add, I brought this up on github after the tutorial
at scipy last week (and I heard the same comments last year and the
year before). I had to stop and spend a few minutes explaining endog
and exog, was met with incredulous looks, and a few people left
shortly thereafter. Many in the audience were sophisticated users with
(applied) statistics or machine learning but not necessarily
econometrics backgrounds and their aversion to this was palpable.

We can try to rationalize away other user's difficulties, but the fact
remains that this is a stumbling block for new users, especially in a
"live" setting. Word of mouth travels, and I'd hope that "easy to use"
would be the first word, not "it's ok but all the models take this
endog and exog stuff, which I have to figure out every time."

Skipper

Skipper Seabold

unread,
Jul 23, 2012, 10:31:19 AM7/23/12
to pystat...@googlegroups.com
On Sun, Jul 22, 2012 at 8:54 PM, VincentAB <vincen...@gmail.com> wrote:
> This thread is for discussion of statsmodels' exog/endog naming convention.
>

Thanks for summarizing! This is very helpful.
This also assumes that users bring up our docs every time they use the
library, though I will say that if we switched to having more
documentation templating that every model could have a clear
description of endog, exog that we write once and use everywhere.

>
> Option 2: y, X , Z
> =============
>
> Benefits:
>
> * Ubiquitous/Standard: used in lots of textbooks and by nearly all
> competitor software (e.g. R, Matlab, Stata)
> * Immediately understandable by most; no longer a stumbling block for new
> users
> * Consistent with related Python work (e.g. Pandas, PySal)
>
> Problems:
>
> * Non-pythonic
> * Any introspection in a class that has one letter (or one greek letter)
> attributes is mostly useless.
> * Non-expressive? "It always takes me an extra half a minute to figure out
> which is which, A, x, b for example." "Try to read the VAR code without the
> text book."
>
> Response:
>
> * "I don't think this is as drastically non-expressive as it seems on first
> glance. It's nearly universal. I'm not wild about one variable names either,
> it's non-pythonic, but I'm starting to come around to the viewpoint that
> endog and exog are an unnecessary technical/jargon hurdle."
> * One-letter names in code are a completely different issue. "No one is
> arguing for using more one-letter variables all over the place, just x, y
> [and maybe z]."
>

I'm mildly in favor of biting the bullet and making this change. As I
mention down-thread, this is mostly due to endog/exog not being
_immediately_ clear without explanations, without bigger, more, and
better docs or assuming that everyone has an econometrics background.
I think Josef has shared the sentiment that users of a library
shouldn't even have to look at the docs. It should be obvious from the
call signature what the main arguments are for. We've been told that
we don't have this right now.

Everyone we've heard from that's not Josef, Alan, or I is in favor of
changing this (though I'll wait for this thread to take its course).
For users, old and new, (and new developers) this seems to be a
strongly desired switch. I don't think this feedback should be ignored
without considering what we gain by keeping endog and exog vs.
switching.

The main concern from developers (other than it's going to be a PITA
to switch, which is true but shouldn't be a deal breaker I don't
think.) is that we're going to all of a sudden have unclear and
obfuscated code. I'm arguing that this, while generally a valid
concern, is not going to happen just from switching endog -> y (or Y
for 2d?) and exog -> X. We have well commented, modular (mostly, when
I'm not knee deep in spaghetti) code. Changing the default variable
names to another _consistent_, _documented_ but more generally
accepted pattern isn't going to undo that.

> Option 3: indep_var / dep_var
> ======================
>
> Benefits:
>
> * Not single letter
> * More informative than X,y?
>
> Problems:
>
> * Sometimes ambiguous or misnomer.
>

I don't think this gains us much over endog/exog. It may solve the
jargon issue but still suffers from the ambiguity issue.

> Other stuff
> ========
>
> * Is Formula integration a good or bad time to do this?

I'm not sure about this yet, but it's my suspicion that this is the
right time to do this. We're still a young code-base, but I think this
is moving us in the right direction of becoming more than a developer
library.

> * IIRC, there was a brief message exchange that the sklearn developers are
> not so happy with the one letter names anymore.

This is a bit out of context and I believe was discussed for using one
letter names for, perhaps non-standard, tuning parameters. IIRC it was
spurred by the one letter tuning parameter in SVM being different than
the parameter of the same name in libsvm.

> * "How about moving to X & Y but describing them as exogenous & endogenous
> in the docs? That way the library keeps its econ heritage but moves to a
> generic naming convention."
>

I think this is more in the spirit of what we should striving for.

My $.02,

Skipper

Nathaniel Smith

unread,
Jul 23, 2012, 10:33:10 AM7/23/12
to pystat...@googlegroups.com
On Mon, Jul 23, 2012 at 3:00 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> Word of mouth travels, and I'd hope that "easy to use"
> would be the first word, not "it's ok but all the models take this
> endog and exog stuff, which I have to figure out every time."

This is probably a good place to admit that technically this does not
describe my experience, because while I can never remember the
difference between endog and exog, I also have never succeeded in
figuring it out by googling. Mostly when writing emails to this list
I've just given up and used other terminology, or else just named one
thing "endog" and one thing "exog" at random and figured that if I got
them backwards then you'd still understand.

Wikipedia even has a nice article on what "endogenous" means in
economics, which seems like it would be helpful, but their usage is
completely inconsistent with statsmodels' usage...:
https://en.wikipedia.org/wiki/Endogeneity_%28economics%29

FWIW, I think it's misguided to object to "X" and "y" *simply* on the
grounds that single-letter variable names are bad. The important thing
is for variable names to be clear, and to not give in to the
temptation to use abbreviations that only make sense when writing the
code and not when reading it later. So single letter variable names
are only justified when they really are the best name, and are
sufficiently established that later readers will be able to recognize
the meaning from that single letter. The classic example is using "i"
as an index variable in a loop -- that's totally okay. I think one can
reasonably argue that X and y meet those criteria.

(Technically 'beta' and 'sigma' are also single-letter variable names.)

-n

Alan G Isaac

unread,
Jul 23, 2012, 10:35:16 AM7/23/12
to pystat...@googlegroups.com
On 7/23/2012 10:00 AM, Skipper Seabold wrote:
> Yes, to echo and add, I brought this up on github after the tutorial
> at scipy last week (and I heard the same comments last year and the
> year before). I had to stop and spend a few minutes explaining endog
> and exog, was met with incredulous looks, and a few people left
> shortly thereafter. Many in the audience were sophisticated users with
> (applied) statistics or machine learning but not necessarily
> econometrics backgrounds and their aversion to this was palpable.


Perhaps it would be better in such settings to just
say `endog` is our name for Y, and `exog` is our name
for X, if that is the notation they are used to?

Whatever the final choice, I hope it will fit well
in systems estimation frameworks, and in particular
I hope that it easily allows a clear distinction
between lagged endogs and other predetermined variables.
(Obviously I am not suggesting that choice of name
alone will have implications for such issues.)

Alan Isaac

josef...@gmail.com

unread,
Jul 23, 2012, 10:51:25 AM7/23/12
to pystat...@googlegroups.com
a quick check with Stata:

regress depvar [indepvars] [if] [in] [weight] [, options]

the gui spells out "dependent variable" and "independent variable"

SAS Syntax

Syntax: REG Procedure

The following statements are available in PROC REG:
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_sect006.htm

PROC REG <options> ;
<label:>MODEL dependents=<regressors> </ options> ;
BY variables ;
...

examples use y and x as place holders

Matlab: What's W,X,Y0,W0 ?
http://www.mathworks.com/help/toolbox/econ/vgxproc.html
Generate VARMAX model responses from innovations
Synopsis
[Y,logL] = vgxproc(Spec,W)
[Y,logL] = vgxproc(Spec,W,X,Y0,W0)
explanation: X Exogenous data


SPSS: http://academic.udayton.edu/gregelvers/psy216/SPSS/reg.htm
pictures of GUI
Dependent, Independent(s)


only matlab and R seem to favor x,y or X,y, or X,Y, and in R you
usually don't see them if you have to use formulas.

I like Stata.

Josef

>
> Skipper

Skipper Seabold

unread,
Jul 23, 2012, 11:05:43 AM7/23/12
to pystat...@googlegroups.com
Just to point out, VAR is an estimator (almost entirely) particular to
econometrics.

>
> SPSS: http://academic.udayton.edu/gregelvers/psy216/SPSS/reg.htm
> pictures of GUI
> Dependent, Independent(s)
>
>
> only matlab and R seem to favor x,y or X,y, or X,Y, and in R you
> usually don't see them if you have to use formulas.
>
> I like Stata.

For the record, I'm not against depvar vs. indepvar, but I will point
out two things. 1) Stata (as an example since I'm more familiar with
programming estimators in it) doesn't use OO, so as soon as you get
down the inheritance chain (for us), you have things like

sureg (depvar1 varlist1) (depvar2 varlist2) ... (depvarN varlistN)
[if] [in] [weight]

reg3 (depvar1 varlist1) (depvar2 varlist2) ...(depvarN varlistN) [if]
[in] [weight]

Y and X won't have the problem of having to switch nomenclature when
it's no longer theoretically correct I guess.

2) MATLAB and R are peculiar in that they're also programming
languages. Stata, SAS, and SPSS are primarly GUIs (Mata and IML
aside), so that you're never actually typing depvar = y. These are
mainly just used for documentation, which we can certainly include.
SAS I guess is the odd man out in that you are actually typing
dependents = ... I guess, but who wants to be more like SAS?

Skipper

josef...@gmail.com

unread,
Jul 23, 2012, 12:11:02 PM7/23/12
to pystat...@googlegroups.com
we still need to be explicit for some cases in systemfit (or maybe
not, if the instruments are chosen correctly)

Model 2
exog(varlist) exogenous variables not specified in
system equations
endog(varlist) additional right-hand-side endogenous variables
inst(varlist) full list of exogenous variables

http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_syslin_sect044.htm

I don't understand much in the R docs ?
http://rss.acs.unt.edu/Rdoc/library/systemfit/html/systemfit.html
the version on my computer is even shorter, and seems to be more recent

But like VAR that's again "econometrics"

Next time I try to find something statistics, but it's not so easy to
find something without E(X \epsilon) != 0
(maybe GEE)

Josef
>
> Y and X won't have the problem of having to switch nomenclature when
> it's no longer theoretically correct I guess.
>
> 2) MATLAB and R are peculiar in that they're also programming
> languages. Stata, SAS, and SPSS are primarly GUIs (Mata and IML
> aside), so that you're never actually typing depvar = y. These are
> mainly just used for documentation, which we can certainly include.
> SAS I guess is the odd man out in that you are actually typing
> dependents = ... I guess, but who wants to be more like SAS?

SAS has very good documentation, and I wouldn't mind selling a license
for $100000

>
> Skipper

achompas

unread,
Jul 23, 2012, 3:07:17 PM7/23/12
to pystat...@googlegroups.com

> * "How about moving to X & Y but describing them as exogenous & endogenous
> in the docs? That way the library keeps its econ heritage but moves to a
> generic naming convention."
>

I think this is more in the spirit of what we should striving for.

My $.02,

Skipper

That's my suggestion from the Github Issues branch, and I think it's one of two "right ways" to resolve this problem (the other being "stick with exog & endog but identify them as statsmodels's analogues for X & y"). Here's why I see these as the "right solutions" for this problem.

First, let's identify the issue. What's the problem here? New users can't map function calls to their knowledge of statistical models. Let's say I want to use (or contribute to) statsmodels for the first time. Then, in IPython:

import statsmodels.api as sm
sm.GLS?

Then...wait, what do I do? Here's the relevant part of the docstring:

    Generalized least squares model with a general covariance structure.

    Parameters
    ----------
    endog : array-like
           endog is a 1-d vector that contains the response/independent variable
    exog : array-like
           exog is a n x p vector where n is the number of observations and p is
           the number of regressors/dependent variables including the intercept
           if one is included in the data.

First, this docstring is wrong since endog is the dependent variable while exog is the independent variable. That said, the corrected version might make sense to some veteran econometricians. For others, they won't know how to map this to their textbook understanding of GLMs. Then they'll hit Ctrl-D, enter 'R' at the bash prompt, and get back to work.

If we're thinking of statsmodels as an engineering project, what should its developers do? Take action that (a) doesn't require a lot of development time, yet (b) manages to thoroughly resolve the problem for the foreseeable future. I really really want X and y, but that will take a full refactor. What's the next best step? Keep endog and exog but update the docs so new devs/users can map them to variables they know.

pandas and statsmodels are two cool libraries. As more people start working with pandas and pandas.TimeSeries (like me with my current project), they'll need to analyze them with statistical or econometric tools. statsmodels is ripe for new users and more contributors, so let's pick a solution that maximizes adoption without requiring too much dev time. 

Skipper Seabold

unread,
Jul 23, 2012, 3:13:09 PM7/23/12
to pystat...@googlegroups.com
On Mon, Jul 23, 2012 at 3:07 PM, achompas <bre...@gmail.com> wrote:
> First, this docstring is wrong since endog is the dependent variable while
> exog is the independent variable.

Well that's no good. Stop-gap:

https://github.com/statsmodels/statsmodels/commit/a515676737168568a9a834f7693199d55f8c7a4f

Skipper

Christoph Deil

unread,
Jul 23, 2012, 10:48:13 AM7/23/12
to pystat...@googlegroups.com
I'm an astronomer and I've used statsmodels a few times, and this is exactly how I feel about statsmodels.

+1 on changing exog / endog to something else, because statsmodels is the only place where I've ever heard these terms and every time I use statsmodels (which is only a few times per year) I have to re-learn them.
Wikipedia doesn't know anything about  "exog" and "endog", using google I couldn't find anything useful quickly. Even in the statsmodels docs I couldn't find a good explanation.

-1 on any of the proposals to use exog / endog internally and x / y in the user interface or the other way around, code / papers where exactly the same thing is called by different names in different places are just unnecessarily confusing.

Here's my proposal for a rename:
xdata for what is called x at the moment.
ydata for what is called y or endog at the moment.
dmatrix or design_matrix for what is called X or exog at the moment.

This way there are no one-letter variable names and there is zero chance to confuse x and X.
I did not come up with these names myself, there is precedence. :-)

Independently of the choice made, I think it is great that Skpper, Vincent and others are trying to make the statsmodels docs more accessible.
At the moment it's really hard, because the statsmodels docs more or less start at http://statsmodels.sourceforge.net/devel/regression.html like this:
"Regression contains linear models with independently and identically distributed errors and for errors with heteroscedasticity or autocorrelation.
The statistical model is assumed to be Y = X * beta + mu, where mu ~ N(0, sigma ** 2 * Sigma), ..."
without explaining that the goal basic purpose of the code is to "compute the best-fit parameters beta given inputs X, Y and mu and that X is the design matrix and how to construct the design matrix for common cases (polynomials, hyperplanes).

Just to be clear, these are the terms physicists understand: data (x, y), linear or nonlinear model, parameter, parameter error, fit.
Physicists have never heard of: exog, endog, not even design matrix.
(I just asked ~ 10 physicists / astronomers on my corridor, not a single one knew any of those three terms, even after I mentioned that they are used in conjunction with fitting linear models.)
I can see now that the term design matrix is very central to fitting linear models, so it should be in statsmodels, but I sure would like to see exog and endog go and design matrix better explained in the docs.

Here's a little code example using ROOT, the data analysis package most widely used by physicists, that shows an API that they would understand:
ROOT has python bindings ( http://root.cern.ch/drupal/content/how-use-use-python-pyroot-interpreter )and a formula framework ( http://root.cern.ch/root/html/TF1.html ) that allows easily defining and fitting linear and non-linear models. I don't mean to say that ROOT and the python bindings are all gold, there are both small annoying and serious principal problems with using ROOT from python, and more and more physicists are using numpy / scipy, and hopefully soon also statsmodels if you guys manage to make it understandable for us.

Thanks for working on statsmodels and considering huge API changes with renaming the basic input parameters and integrating formulas.
I believe it would make statsmodels much more accessible to physicists (and I believe most other scientists and engineers, basically most data analysts outside econometrics / statistics?) and would be worth the trouble now in the long run.

Christoph

josef...@gmail.com

unread,
Jul 24, 2012, 8:47:35 PM7/24/12
to pystat...@googlegroups.com
Thanks Christoph, pretty convincing arguments below, the science and
engineering tradition of numpy and scipy has partially slipped my
mind.

>
> +1 on changing exog / endog to something else, because statsmodels is the
> only place where I've ever heard these terms and every time I use
> statsmodels (which is only a few times per year) I have to re-learn them.
> Wikipedia doesn't know anything about "exog" and "endog", using google I
> couldn't find anything useful quickly. Even in the statsmodels docs I
> couldn't find a good explanation.

definitely a documentation failure

>
> -1 on any of the proposals to use exog / endog internally and x / y in the
> user interface or the other way around, code / papers where exactly the same
> thing is called by different names in different places are just
> unnecessarily confusing.

That's exactly what I was thinking of as a compromise, use x,y in the
signature and some more informative names in the models and result
instances. (more below)

>
> Here's my proposal for a rename:
> xdata for what is called x at the moment.
> ydata for what is called y or endog at the moment.

xdata and ydata doesn't really look more informative than x, y

> dmatrix or design_matrix for what is called X or exog at the moment.

I never heard of a design_matrix before stats.models. That's an
unknown or (essentially) unused name in econometrics, since in
econometrics we seldom *design* or data.

R lm uses model and x for the dataframe and matrix of regressors. (I
don't know the details.) "model" is at least a weird name for a matrix
of explanatory variables.
I'm not sure in these cases whether it's the implementation that is
relevant or just the examples that use x and y.

>
> Thanks for working on statsmodels and considering huge API changes with
> renaming the basic input parameters and integrating formulas.
> I believe it would make statsmodels much more accessible to physicists (and
> I believe most other scientists and engineers, basically most data analysts
> outside econometrics / statistics?) and would be worth the trouble now in
> the long run.
>
> Christoph
>

I can see that exog/endog doesn't have much meaning outside of
econometrics, social and a few other sciences.

My problem is that I really don't like letter names. x, y, i are in my
opinion temporary variables. I try not to use i as a loop index unless
the loop is just a few lines or a list comprehension. In longer loops
I always worry I might have used `i` already and better not use it at
all. x and y are also generic names, I might have used them as temp
variable, which would accidentally overwrite the real ones. (xdata and
ydata sounds better again.)

As alternative to exogenous and endogenous variables, I think, the
only ones that are not a misnomer in some cases are dependent variable
and explanatory variables, independent variable is a nicer name but
means roughly the same as exog.
I never found a good short name for explanatory variable.


How deep do we want to change?

Given that Alan and I are a small minority, let's assume we switch to x and y.

Changing the signature of the models is easy OLS(y, x) RLM(y,x)

The question is what to do internally.

OLS, WLS, GLS, RLM, GLM and discrete: large parts of our current core
models are easy y=endog, x=exog, Then, there are wexog, wendog.

Do any users care what ols_results.models.wendog is called?

In tsa it gets a bit more complicated,
VARX has the regression matrix [past y, constant, trend, and real x]
(where x=exogenous variables and not yet implemented) (and the
regression matrix is shortened relative to the full data)
ARX, ARMAX similar to VARX, except it uses Kalman Filter and state
space representation.

some discrete and GLM models allow for 2d y/exog that stacks some
additional information.

...

multi-equation models
GLSHet(endog, exog, exog_var=None, weights=None, link=None)
system of equations ...
.....


datasets

from statsmodels.datasets.longley import load
data = load()
data.endog, data.exog - I know from the name that these are for
the estimation

data.x - I have no idea whether this is the transformed, selected
data for the estimation example or the full dataset, or just some
intermediate data


documentation

examples: What do people use, when there is a specific dataset

x_longley ?
x_grunfeld ?

my style:
>>> sorted([name for name in locals().keys() if name[0] in ['e', 'y', 'x']])
['endog_aircraft', 'endog_sal', 'endog_sal0', 'endog_wood',
'exog_aircraft', 'exog_sal', 'exog_sal0', 'exog_wood']

Josef

VincentAB

unread,
Jul 24, 2012, 9:20:05 PM7/24/12
to pystat...@googlegroups.com
re: examples

Now that `patsy` is there and that creating design matrices from raw data takes just a single line of code, I don't think that examples should ever use the exog and endog matrices stored in statsmodels data objects.

1) Doing that is no more compact or clear than calling patsy to create design matrices from scratch using the raw data
2) Using the data.exog/endog attributes forces users to learn something about the structure of the datasets objects. Most users shouldn't have to care about that since they are unlikely to use these data in actual work.
3) Going through patsy in examples is good practice and a nice teaching opportunity in terms of giving users the tools/knowledge needed to integrate statsmodels in their analysis workflow.

In my view, the exog and endog attributes of datasets should mostly be there for convenience in internal testing, and users should have minimal exposure to them. If this becomes the case^*, then the naming convention won't matter much.

FWIW, I think data.X is fine for data.exog (but capitalization is probably important to denote matrix form).

Vincent

* I can probably help with tweaking examples if people agree with the above.

Skipper Seabold

unread,
Jul 24, 2012, 9:24:16 PM7/24/12
to pystat...@googlegroups.com
On Tue, Jul 24, 2012 at 9:20 PM, VincentAB <vincen...@gmail.com> wrote:
> In my view, the exog and endog attributes of datasets should mostly be there
> for convenience in internal testing, and users should have minimal exposure
> to them. If this becomes the case^*, then the naming convention won't matter
> much.
>

Briefly. This was their original intention. I've mainly switched to
using load_pandas() for examples to be closer to what people are
really doing.

I don't think we would need to rename these.

Skipper

josef...@gmail.com

unread,
Jul 25, 2012, 3:36:56 AM7/25/12
to pystat...@googlegroups.com
On Tue, Jul 24, 2012 at 9:20 PM, VincentAB <vincen...@gmail.com> wrote:
> re: examples
>
> Now that `patsy` is there and that creating design matrices from raw data
> takes just a single line of code, I don't think that examples should ever
> use the exog and endog matrices stored in statsmodels data objects.
>
> 1) Doing that is no more compact or clear than calling patsy to create
> design matrices from scratch using the raw data
> 2) Using the data.exog/endog attributes forces users to learn something
> about the structure of the datasets objects. Most users shouldn't have to
> care about that since they are unlikely to use these data in actual work.
> 3) Going through patsy in examples is good practice and a nice teaching
> opportunity in terms of giving users the tools/knowledge needed to integrate
> statsmodels in their analysis workflow.
>
> In my view, the exog and endog attributes of datasets should mostly be there
> for convenience in internal testing, and users should have minimal exposure
> to them. If this becomes the case^*, then the naming convention won't matter
> much.

Next up some simple GUI widgets to select your y and x variables, then
a full GUI, and users won't realise that there is python inside
instead of R or ....
And we better go commercial before somebody else does.

Josef

Alexandre Gramfort

unread,
Jul 25, 2012, 5:06:58 AM7/25/12
to pystat...@googlegroups.com
hi,

if I may just share more from my experience with scikit-learn, we have the
convention to use capital letters for 2d or more arrays. So we use X and
slowly moving from y to Y in estimators that can work with multiple outputs.

Another convention we use is to add a trailing underscore to quantities
estimated from the data. For example, if beta is the regression
coefficients we use beta_ .
That's pretty convenient to inspect an estimator instance.

It would great to have a consensus on this.

my 0.02 euros cents.

Alex

VincentAB

unread,
Jul 25, 2012, 8:42:27 AM7/25/12
to pystat...@googlegroups.com
Hehe!

The point is that the data set object structure doesn't do *anything* useful for the user in his actual work, so he shouldn't have to learn it. This is not an instance of syntactic sugar hiding other *useful* commands. I'm not advocating for all-automated-all-the-time. Many examples still create their own artificial data Xs and Ys anyway.

Vincent

Skipper Seabold

unread,
Jul 25, 2012, 1:06:01 PM7/25/12
to pystat...@googlegroups.com
Ah, yeah. The crowd at scipy was mainly machine learning and stats,
finance, astronomy, engineering, science (bio, ecology). None of these
people knew endogenous / exogenous.
Off topic, and I brought this up before so it may not be the case
anymore, but IIRC, to confuse the issue patsy is also using dmatrix
for the LHS...
I think calling it whitened_x or whitened_y is even more informative.
I had no idea what these were when I first came to this code. The only
reference I could find was in some of Jonathan's lecture notes IIRC.

>
> In tsa it gets a bit more complicated,
> VARX has the regression matrix [past y, constant, trend, and real x]
> (where x=exogenous variables and not yet implemented) (and the
> regression matrix is shortened relative to the full data)
> ARX, ARMAX similar to VARX, except it uses Kalman Filter and state
> space representation.
>

ARMAX is already using exog internally to denote the whole RHS I believe.

> some discrete and GLM models allow for 2d y/exog that stacks some
> additional information.
>
> ...
>
> multi-equation models
> GLSHet(endog, exog, exog_var=None, weights=None, link=None)
> system of equations ...
> .....
>
>
> datasets
>
> from statsmodels.datasets.longley import load
> data = load()
> data.endog, data.exog - I know from the name that these are for
> the estimation
>
> data.x - I have no idea whether this is the transformed, selected
> data for the estimation example or the full dataset, or just some
> intermediate data
>

I think these maybe could stay. These are mainly for internal/testing
use anyway, and I've moslty switched to *.load_pandas() when I'm doing
anything other than testing.

Nathaniel Smith

unread,
Jul 26, 2012, 7:23:45 AM7/26/12
to pystat...@googlegroups.com
On Mon, Jul 23, 2012 at 3:51 PM, <josef...@gmail.com> wrote:
> a quick check with Stata:
>
> regress depvar [indepvars] [if] [in] [weight] [, options]
>
> the gui spells out "dependent variable" and "independent variable"

I'm fine with "dependent" and "independent", but I think that's just
because it's the jargon I grew up with -- the actual meaning is not at
all transparent.

When talking to non-specialists I think I usually refer to the "y"
variable as the "outcome" and the "x" variables as "predictors" or
"regressors".

-n
Reply all
Reply to author
Forward
0 new messages