|exog / endog||VincentAB||7/22/12 5:54 PM|
This thread is for discussion of statsmodels' exog/endog naming convention.
This post summarizes this exchange: https://github.com/statsmodels/statsmodels/issues/395
* Most models take 2 design matrices as arguments. Currently, they are named "exog" and "endog"
* Feedback from multiple new users: the "endog" and "exog" names are a stumbling block for `statsmodels` adoption. These labels are often encountered in econometrics, but not in many other areas of applied statistics (even by users who have a strong stats background).
* In many contexts, the endog/exog distinction is ambiguous or a misnomer (e.g. models with endogenous regressors).
Option 1: Improving on the status quo
* explanation for endog and exog should be more prominent (e.g. on the getting started page)
* Much less work!
* exog/endog is informative after you've learned what it means
* Moving away would be a big break with backwards compatibility and the current style
* "I thought of endog/exog as a nice distinguishing "feature" of statsmodels which also signals consistency across models."
Option 2: y, X , Z
* Ubiquitous/Standard: used in lots of textbooks and by nearly all competitor software (e.g. R, Matlab, Stata)
* Immediately understandable by most; no longer a stumbling block for new users
* Consistent with related Python work (e.g. Pandas, PySal)
* Any introspection in a class that has one letter (or one greek letter) attributes is mostly useless.
* Non-expressive? "It always takes me an extra half a minute to figure out which is which, A, x, b for example." "Try to read the VAR code without the text book."
* "I don't think this is as drastically non-expressive as it seems on first glance. It's nearly universal. I'm not wild about one variable names either, it's non-pythonic, but I'm starting to come around to the viewpoint that endog and exog are an unnecessary technical/jargon hurdle."
* One-letter names in code are a completely different issue. "No one is arguing for using more one-letter variables all over the place, just x, y [and maybe z]."
Option 3: indep_var / dep_var
* Not single letter
* More informative than X,y?
* Sometimes ambiguous or misnomer.
* Is Formula integration a good or bad time to do this?
* IIRC, there was a brief message exchange that the sklearn developers are not so happy with the one letter names anymore.
* "How about moving to X & Y but describing them as exogenous & endogenous in the docs? That way the library keeps its econ heritage but moves to a generic naming convention."
|Re: [pystatsmodels] exog / endog||Matthew Brett||7/22/12 7:35 PM|
Strongly in favor of Option2 - unambiguous and clear to a wide range
of backgrounds (I believe). Compatibility with other software seems
an obvious win.
I can never remember which one is which of endog / exog. Dependent /
independent is closer to my own tradition but not less inherently
confusing, in my opinion.
Any reason to prefer 'y' over 'Y'? Or the other way round? I guess
Y can sometimes be a matrix?
Thanks for bringing up the discussion.
|Re: [pystatsmodels] exog / endog||josefpktd||7/22/12 7:45 PM|
simple: exog has an x in it
|Re: [pystatsmodels] exog / endog||josefpktd||7/22/12 8:19 PM|
Just as a reminder: We are trying to write a python library with full
classes and a coverage much more than linear models, not just a single
function that does OLS (or collection of functions)
compare introspection in Python with introspection in Matlab, Stata,
and R (and names with Java or Csharp conventions(?))
as any paper in econometric, we will need
|Re: [pystatsmodels] exog / endog||josefpktd||7/22/12 8:36 PM|
On Sun, Jul 22, 2012 at 10:35 PM, Matthew Brett <matthe...@gmail.com> wrote:endog is a 1d or 2d array depending on the model and usage (until now
it's most of the time but not always 1d)
exog can be None, 1d, or 2d, in most cases it's 2d
|Re: [pystatsmodels] exog / endog||Alexandre Gramfort||7/22/12 11:52 PM|
of course I am a bit biased by the scikit-learn naming convention but if I may
share my experience, I have to look at the OLS example to map exog -> X and
endog -> y every time I use statsmodels (but I may not use it enough :) ).
exog and endog does not ring a bell to me. X and y in scikit-learn is not ideal
but it's easy to remember and that's the only variables that we tolerate
as single letters. I think it's a good thing to avoid domain specific jargon
as much as possible in a project that has such a widespread use.
|Re: [pystatsmodels] exog / endog||Alan G Isaac||7/23/12 5:28 AM|
Just a reminder that we already hashed this all out long ago.
|Re: [pystatsmodels] exog / endog||josefpktd||7/23/12 5:37 AM|
On Mon, Jul 23, 2012 at 8:28 AM, Alan G Isaac <ais...@american.edu> wrote:This is mostly a reminder for 3 econometricians. I'm still convinced
that we took the right decision.
But I think we didn't explain this well enough in the documentation.
> Alan Isaac
|Re: [pystatsmodels] exog / endog||Wes McKinney||7/23/12 5:37 AM|
On Mon, Jul 23, 2012 at 8:28 AM, Alan G Isaac <ais...@american.edu> wrote:
> Just a reminder that we already hashed this all out long ago.Yes, but the project needs to be prepared to make some changes in the
best interests of the project's future given the growth of the user
base. The difference between 100 (then) and 10k (now) or 100k users is
significant. Don't you want more users? Out of more users comes more
developers, too. It seems to me like a small price to pay.
For my part, I've found that exog/endog is a problem when teaching
statsmodels to a non-econometric crowd. Even after using the library
for a long time, the names still cause me cognitive dissonance.
|Re: [pystatsmodels] exog / endog||jseabold||7/23/12 7:00 AM|
Yes, to echo and add, I brought this up on github after the tutorial
at scipy last week (and I heard the same comments last year and the
year before). I had to stop and spend a few minutes explaining endog
and exog, was met with incredulous looks, and a few people left
shortly thereafter. Many in the audience were sophisticated users with
(applied) statistics or machine learning but not necessarily
econometrics backgrounds and their aversion to this was palpable.
We can try to rationalize away other user's difficulties, but the fact
remains that this is a stumbling block for new users, especially in a
"live" setting. Word of mouth travels, and I'd hope that "easy to use"
would be the first word, not "it's ok but all the models take this
endog and exog stuff, which I have to figure out every time."
|Re: [pystatsmodels] exog / endog||jseabold||7/23/12 7:31 AM|
On Sun, Jul 22, 2012 at 8:54 PM, VincentAB <vincen...@gmail.com> wrote:
> This thread is for discussion of statsmodels' exog/endog naming convention.Thanks for summarizing! This is very helpful.
This also assumes that users bring up our docs every time they use the
library, though I will say that if we switched to having more
documentation templating that every model could have a clear
description of endog, exog that we write once and use everywhere.
I'm mildly in favor of biting the bullet and making this change. As I
mention down-thread, this is mostly due to endog/exog not being
_immediately_ clear without explanations, without bigger, more, and
better docs or assuming that everyone has an econometrics background.
I think Josef has shared the sentiment that users of a library
shouldn't even have to look at the docs. It should be obvious from the
call signature what the main arguments are for. We've been told that
we don't have this right now.
Everyone we've heard from that's not Josef, Alan, or I is in favor of
changing this (though I'll wait for this thread to take its course).
For users, old and new, (and new developers) this seems to be a
strongly desired switch. I don't think this feedback should be ignored
without considering what we gain by keeping endog and exog vs.
The main concern from developers (other than it's going to be a PITA
to switch, which is true but shouldn't be a deal breaker I don't
think.) is that we're going to all of a sudden have unclear and
obfuscated code. I'm arguing that this, while generally a valid
concern, is not going to happen just from switching endog -> y (or Y
for 2d?) and exog -> X. We have well commented, modular (mostly, when
I'm not knee deep in spaghetti) code. Changing the default variable
names to another _consistent_, _documented_ but more generally
accepted pattern isn't going to undo that.
I don't think this gains us much over endog/exog. It may solve the
jargon issue but still suffers from the ambiguity issue.
I'm not sure about this yet, but it's my suspicion that this is the
right time to do this. We're still a young code-base, but I think this
is moving us in the right direction of becoming more than a developer
This is a bit out of context and I believe was discussed for using one
letter names for, perhaps non-standard, tuning parameters. IIRC it was
spurred by the one letter tuning parameter in SVM being different than
the parameter of the same name in libsvm.
I think this is more in the spirit of what we should striving for.
|Re: [pystatsmodels] exog / endog||Nathaniel Smith||7/23/12 7:33 AM|
On Mon, Jul 23, 2012 at 3:00 PM, Skipper Seabold <jsse...@gmail.com> wrote:This is probably a good place to admit that technically this does not
describe my experience, because while I can never remember the
difference between endog and exog, I also have never succeeded in
figuring it out by googling. Mostly when writing emails to this list
I've just given up and used other terminology, or else just named one
thing "endog" and one thing "exog" at random and figured that if I got
them backwards then you'd still understand.
Wikipedia even has a nice article on what "endogenous" means in
economics, which seems like it would be helpful, but their usage is
completely inconsistent with statsmodels' usage...:
FWIW, I think it's misguided to object to "X" and "y" *simply* on the
grounds that single-letter variable names are bad. The important thing
is for variable names to be clear, and to not give in to the
temptation to use abbreviations that only make sense when writing the
code and not when reading it later. So single letter variable names
are only justified when they really are the best name, and are
sufficiently established that later readers will be able to recognize
the meaning from that single letter. The classic example is using "i"
as an index variable in a loop -- that's totally okay. I think one can
reasonably argue that X and y meet those criteria.
(Technically 'beta' and 'sigma' are also single-letter variable names.)
|Re: [pystatsmodels] exog / endog||Alan G Isaac||7/23/12 7:35 AM|
On 7/23/2012 10:00 AM, Skipper Seabold wrote:Perhaps it would be better in such settings to just
say `endog` is our name for Y, and `exog` is our name
for X, if that is the notation they are used to?
Whatever the final choice, I hope it will fit well
in systems estimation frameworks, and in particular
I hope that it easily allows a clear distinction
between lagged endogs and other predetermined variables.
(Obviously I am not suggesting that choice of name
alone will have implications for such issues.)
|Re: [pystatsmodels] exog / endog||josefpktd||7/23/12 7:51 AM|
a quick check with Stata:
regress depvar [indepvars] [if] [in] [weight] [, options]
the gui spells out "dependent variable" and "independent variable"
Syntax: REG Procedure
The following statements are available in PROC REG:
PROC REG <options> ;
<label:>MODEL dependents=<regressors> </ options> ;
BY variables ;
examples use y and x as place holders
Matlab: What's W,X,Y0,W0 ?
Generate VARMAX model responses from innovations
[Y,logL] = vgxproc(Spec,W)
[Y,logL] = vgxproc(Spec,W,X,Y0,W0)
explanation: X Exogenous data
pictures of GUI
only matlab and R seem to favor x,y or X,y, or X,Y, and in R you
usually don't see them if you have to use formulas.
I like Stata.
|Re: [pystatsmodels] exog / endog||jseabold||7/23/12 8:05 AM|
Just to point out, VAR is an estimator (almost entirely) particular to
For the record, I'm not against depvar vs. indepvar, but I will point
out two things. 1) Stata (as an example since I'm more familiar with
programming estimators in it) doesn't use OO, so as soon as you get
down the inheritance chain (for us), you have things like
sureg (depvar1 varlist1) (depvar2 varlist2) ... (depvarN varlistN)
[if] [in] [weight]
reg3 (depvar1 varlist1) (depvar2 varlist2) ...(depvarN varlistN) [if]
Y and X won't have the problem of having to switch nomenclature when
it's no longer theoretically correct I guess.
2) MATLAB and R are peculiar in that they're also programming
languages. Stata, SAS, and SPSS are primarly GUIs (Mata and IML
aside), so that you're never actually typing depvar = y. These are
mainly just used for documentation, which we can certainly include.
SAS I guess is the odd man out in that you are actually typing
dependents = ... I guess, but who wants to be more like SAS?
|Re: [pystatsmodels] exog / endog||josefpktd||7/23/12 9:11 AM|
we still need to be explicit for some cases in systemfit (or maybe
not, if the instruments are chosen correctly)
exog(varlist) exogenous variables not specified in
endog(varlist) additional right-hand-side endogenous variables
inst(varlist) full list of exogenous variables
I don't understand much in the R docs ?
the version on my computer is even shorter, and seems to be more recent
But like VAR that's again "econometrics"
Next time I try to find something statistics, but it's not so easy to
find something without E(X \epsilon) != 0
>SAS has very good documentation, and I wouldn't mind selling a license
|Re: [pystatsmodels] exog / endog||achompas||7/23/12 12:07 PM|
That's my suggestion from the Github Issues branch, and I think it's one of two "right ways" to resolve this problem (the other being "stick with exog & endog but identify them as statsmodels's analogues for X & y"). Here's why I see these as the "right solutions" for this problem.
First, let's identify the issue. What's the problem here? New users can't map function calls to their knowledge of statistical models. Let's say I want to use (or contribute to) statsmodels for the first time. Then, in IPython:
import statsmodels.api as sm
Then...wait, what do I do? Here's the relevant part of the docstring:
Generalized least squares model with a general covariance structure.
endog : array-like
endog is a 1-d vector that contains the response/independent variable
exog : array-like
exog is a n x p vector where n is the number of observations and p is
the number of regressors/dependent variables including the intercept
if one is included in the data.
First, this docstring is wrong since endog is the dependent variable while exog is the independent variable. That said, the corrected version might make sense to some veteran econometricians. For others, they won't know how to map this to their textbook understanding of GLMs. Then they'll hit Ctrl-D, enter 'R' at the bash prompt, and get back to work.
If we're thinking of statsmodels as an engineering project, what should its developers do? Take action that (a) doesn't require a lot of development time, yet (b) manages to thoroughly resolve the problem for the foreseeable future. I really really want X and y, but that will take a full refactor. What's the next best step? Keep endog and exog but update the docs so new devs/users can map them to variables they know.
pandas and statsmodels are two cool libraries. As more people start working with pandas and pandas.TimeSeries (like me with my current project), they'll need to analyze them with statistical or econometric tools. statsmodels is ripe for new users and more contributors, so let's pick a solution that maximizes adoption without requiring too much dev time.
|Re: [pystatsmodels] exog / endog||jseabold||7/23/12 12:13 PM|
On Mon, Jul 23, 2012 at 3:07 PM, achompas <bre...@gmail.com> wrote:Well that's no good. Stop-gap:
|Re: [pystatsmodels] exog / endog||Christoph Deil||7/23/12 7:48 AM|
I'm an astronomer and I've used statsmodels a few times, and this is exactly how I feel about statsmodels.
+1 on changing exog / endog to something else, because statsmodels is the only place where I've ever heard these terms and every time I use statsmodels (which is only a few times per year) I have to re-learn them.
Wikipedia doesn't know anything about "exog" and "endog", using google I couldn't find anything useful quickly. Even in the statsmodels docs I couldn't find a good explanation.
-1 on any of the proposals to use exog / endog internally and x / y in the user interface or the other way around, code / papers where exactly the same thing is called by different names in different places are just unnecessarily confusing.
Here's my proposal for a rename:
xdata for what is called x at the moment.
ydata for what is called y or endog at the moment.
dmatrix or design_matrix for what is called X or exog at the moment.
This way there are no one-letter variable names and there is zero chance to confuse x and X.
I did not come up with these names myself, there is precedence. :-)
xdata and ydata is used in http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html and dmatrix is used in http://patsy.readthedocs.org/en/v0.1.0/formulas.html .
Independently of the choice made, I think it is great that Skpper, Vincent and others are trying to make the statsmodels docs more accessible.
At the moment it's really hard, because the statsmodels docs more or less start at http://statsmodels.sourceforge.net/devel/regression.html like this:
"Regression contains linear models with independently and identically distributed errors and for errors with heteroscedasticity or autocorrelation.
The statistical model is assumed to be Y = X * beta + mu, where mu ~ N(0, sigma ** 2 * Sigma), ..."
without explaining that the goal basic purpose of the code is to "compute the best-fit parameters beta given inputs X, Y and mu and that X is the design matrix and how to construct the design matrix for common cases (polynomials, hyperplanes).
Just to be clear, these are the terms physicists understand: data (x, y), linear or nonlinear model, parameter, parameter error, fit.
Physicists have never heard of: exog, endog, not even design matrix.
(I just asked ~ 10 physicists / astronomers on my corridor, not a single one knew any of those three terms, even after I mentioned that they are used in conjunction with fitting linear models.)
I can see now that the term design matrix is very central to fitting linear models, so it should be in statsmodels, but I sure would like to see exog and endog go and design matrix better explained in the docs.
Here's a little code example using ROOT, the data analysis package most widely used by physicists, that shows an API that they would understand:
ROOT has python bindings ( http://root.cern.ch/drupal/content/how-use-use-python-pyroot-interpreter )and a formula framework ( http://root.cern.ch/root/html/TF1.html ) that allows easily defining and fitting linear and non-linear models. I don't mean to say that ROOT and the python bindings are all gold, there are both small annoying and serious principal problems with using ROOT from python, and more and more physicists are using numpy / scipy, and hopefully soon also statsmodels if you guys manage to make it understandable for us.
Thanks for working on statsmodels and considering huge API changes with renaming the basic input parameters and integrating formulas.
I believe it would make statsmodels much more accessible to physicists (and I believe most other scientists and engineers, basically most data analysts outside econometrics / statistics?) and would be worth the trouble now in the long run.
|Re: [pystatsmodels] exog / endog||josefpktd||7/24/12 5:47 PM|
Thanks Christoph, pretty convincing arguments below, the science and
engineering tradition of numpy and scipy has partially slipped my
definitely a documentation failure
That's exactly what I was thinking of as a compromise, use x,y in the
signature and some more informative names in the models and result
instances. (more below)
xdata and ydata doesn't really look more informative than x, y
I never heard of a design_matrix before stats.models. That's an
unknown or (essentially) unused name in econometrics, since in
econometrics we seldom *design* or data.
R lm uses model and x for the dataframe and matrix of regressors. (I
don't know the details.) "model" is at least a weird name for a matrix
of explanatory variables.
I'm not sure in these cases whether it's the implementation that is
relevant or just the examples that use x and y.
I can see that exog/endog doesn't have much meaning outside of
econometrics, social and a few other sciences.
My problem is that I really don't like letter names. x, y, i are in my
opinion temporary variables. I try not to use i as a loop index unless
the loop is just a few lines or a list comprehension. In longer loops
I always worry I might have used `i` already and better not use it at
all. x and y are also generic names, I might have used them as temp
variable, which would accidentally overwrite the real ones. (xdata and
ydata sounds better again.)
As alternative to exogenous and endogenous variables, I think, the
only ones that are not a misnomer in some cases are dependent variable
and explanatory variables, independent variable is a nicer name but
means roughly the same as exog.
I never found a good short name for explanatory variable.
How deep do we want to change?
Given that Alan and I are a small minority, let's assume we switch to x and y.
Changing the signature of the models is easy OLS(y, x) RLM(y,x)
The question is what to do internally.
OLS, WLS, GLS, RLM, GLM and discrete: large parts of our current core
models are easy y=endog, x=exog, Then, there are wexog, wendog.
Do any users care what ols_results.models.wendog is called?
In tsa it gets a bit more complicated,
VARX has the regression matrix [past y, constant, trend, and real x]
(where x=exogenous variables and not yet implemented) (and the
regression matrix is shortened relative to the full data)
ARX, ARMAX similar to VARX, except it uses Kalman Filter and state
some discrete and GLM models allow for 2d y/exog that stacks some
GLSHet(endog, exog, exog_var=None, weights=None, link=None)
system of equations ...
from statsmodels.datasets.longley import load
data = load()
data.endog, data.exog - I know from the name that these are for
data.x - I have no idea whether this is the transformed, selected
data for the estimation example or the full dataset, or just some
examples: What do people use, when there is a specific dataset
>>> sorted([name for name in locals().keys() if name in ['e', 'y', 'x']])
['endog_aircraft', 'endog_sal', 'endog_sal0', 'endog_wood',
'exog_aircraft', 'exog_sal', 'exog_sal0', 'exog_wood']
|Re: [pystatsmodels] exog / endog||VincentAB||7/24/12 6:20 PM|
Now that `patsy` is there and that creating design matrices from raw data takes just a single line of code, I don't think that examples should ever use the exog and endog matrices stored in statsmodels data objects.
1) Doing that is no more compact or clear than calling patsy to create design matrices from scratch using the raw data
2) Using the data.exog/endog attributes forces users to learn something about the structure of the datasets objects. Most users shouldn't have to care about that since they are unlikely to use these data in actual work.
3) Going through patsy in examples is good practice and a nice teaching opportunity in terms of giving users the tools/knowledge needed to integrate statsmodels in their analysis workflow.
In my view, the exog and endog attributes of datasets should mostly be there for convenience in internal testing, and users should have minimal exposure to them. If this becomes the case^*, then the naming convention won't matter much.
FWIW, I think data.X is fine for data.exog (but capitalization is probably important to denote matrix form).
* I can probably help with tweaking examples if people agree with the above.
|Re: [pystatsmodels] exog / endog||jseabold||7/24/12 6:24 PM|
On Tue, Jul 24, 2012 at 9:20 PM, VincentAB <vincen...@gmail.com> wrote:Briefly. This was their original intention. I've mainly switched to
using load_pandas() for examples to be closer to what people are
I don't think we would need to rename these.
|Re: [pystatsmodels] exog / endog||josefpktd||7/25/12 12:36 AM|
On Tue, Jul 24, 2012 at 9:20 PM, VincentAB <vincen...@gmail.com> wrote:
> re: examplesNext up some simple GUI widgets to select your y and x variables, then
a full GUI, and users won't realise that there is python inside
instead of R or ....
And we better go commercial before somebody else does.
|Re: [pystatsmodels] exog / endog||Alexandre Gramfort||7/25/12 2:06 AM|
if I may just share more from my experience with scikit-learn, we have the
convention to use capital letters for 2d or more arrays. So we use X and
slowly moving from y to Y in estimators that can work with multiple outputs.
Another convention we use is to add a trailing underscore to quantities
estimated from the data. For example, if beta is the regression
coefficients we use beta_ .
That's pretty convenient to inspect an estimator instance.
It would great to have a consensus on this.
my 0.02 euros cents.
|Re: [pystatsmodels] exog / endog||VincentAB||7/25/12 5:42 AM|
The point is that the data set object structure doesn't do *anything* useful for the user in his actual work, so he shouldn't have to learn it. This is not an instance of syntactic sugar hiding other *useful* commands. I'm not advocating for all-automated-all-the-time. Many examples still create their own artificial data Xs and Ys anyway.
|Re: [pystatsmodels] exog / endog||jseabold||7/25/12 10:06 AM|
Ah, yeah. The crowd at scipy was mainly machine learning and stats,
finance, astronomy, engineering, science (bio, ecology). None of these
people knew endogenous / exogenous.
Off topic, and I brought this up before so it may not be the case
anymore, but IIRC, to confuse the issue patsy is also using dmatrix
for the LHS...
I think calling it whitened_x or whitened_y is even more informative.
I had no idea what these were when I first came to this code. The only
reference I could find was in some of Jonathan's lecture notes IIRC.
ARMAX is already using exog internally to denote the whole RHS I believe.
I think these maybe could stay. These are mainly for internal/testing
use anyway, and I've moslty switched to *.load_pandas() when I'm doing
anything other than testing.
|Re: [pystatsmodels] exog / endog||Nathaniel Smith||7/26/12 4:23 AM|
On Mon, Jul 23, 2012 at 3:51 PM, <josef...@gmail.com> wrote:I'm fine with "dependent" and "independent", but I think that's just
because it's the jargon I grew up with -- the actual meaning is not at
When talking to non-specialists I think I usually refer to the "y"
variable as the "outcome" and the "x" variables as "predictors" or