Thanks Christoph, pretty convincing arguments below, the science and
engineering tradition of numpy and scipy has partially slipped my
mind.
>
> +1 on changing exog / endog to something else, because statsmodels is the
> only place where I've ever heard these terms and every time I use
> statsmodels (which is only a few times per year) I have to re-learn them.
> Wikipedia doesn't know anything about "exog" and "endog", using google I
> couldn't find anything useful quickly. Even in the statsmodels docs I
> couldn't find a good explanation.
definitely a documentation failure
>
> -1 on any of the proposals to use exog / endog internally and x / y in the
> user interface or the other way around, code / papers where exactly the same
> thing is called by different names in different places are just
> unnecessarily confusing.
That's exactly what I was thinking of as a compromise, use x,y in the
signature and some more informative names in the models and result
instances. (more below)
>
> Here's my proposal for a rename:
> xdata for what is called x at the moment.
> ydata for what is called y or endog at the moment.
xdata and ydata doesn't really look more informative than x, y
> dmatrix or design_matrix for what is called X or exog at the moment.
I never heard of a design_matrix before stats.models. That's an
unknown or (essentially) unused name in econometrics, since in
econometrics we seldom *design* or data.
R lm uses model and x for the dataframe and matrix of regressors. (I
don't know the details.) "model" is at least a weird name for a matrix
of explanatory variables.
I'm not sure in these cases whether it's the implementation that is
relevant or just the examples that use x and y.
>
> Thanks for working on statsmodels and considering huge API changes with
> renaming the basic input parameters and integrating formulas.
> I believe it would make statsmodels much more accessible to physicists (and
> I believe most other scientists and engineers, basically most data analysts
> outside econometrics / statistics?) and would be worth the trouble now in
> the long run.
>
> Christoph
>
I can see that exog/endog doesn't have much meaning outside of
econometrics, social and a few other sciences.
My problem is that I really don't like letter names. x, y, i are in my
opinion temporary variables. I try not to use i as a loop index unless
the loop is just a few lines or a list comprehension. In longer loops
I always worry I might have used `i` already and better not use it at
all. x and y are also generic names, I might have used them as temp
variable, which would accidentally overwrite the real ones. (xdata and
ydata sounds better again.)
As alternative to exogenous and endogenous variables, I think, the
only ones that are not a misnomer in some cases are dependent variable
and explanatory variables, independent variable is a nicer name but
means roughly the same as exog.
I never found a good short name for explanatory variable.
How deep do we want to change?
Given that Alan and I are a small minority, let's assume we switch to x and y.
Changing the signature of the models is easy OLS(y, x) RLM(y,x)
The question is what to do internally.
OLS, WLS, GLS, RLM, GLM and discrete: large parts of our current core
models are easy y=endog, x=exog, Then, there are wexog, wendog.
Do any users care what ols_results.models.wendog is called?
In tsa it gets a bit more complicated,
VARX has the regression matrix [past y, constant, trend, and real x]
(where x=exogenous variables and not yet implemented) (and the
regression matrix is shortened relative to the full data)
ARX, ARMAX similar to VARX, except it uses Kalman Filter and state
space representation.
some discrete and GLM models allow for 2d y/exog that stacks some
additional information.
...
multi-equation models
GLSHet(endog, exog, exog_var=None, weights=None, link=None)
system of equations ...
.....
datasets
from statsmodels.datasets.longley import load
data = load()
data.endog, data.exog - I know from the name that these are for
the estimation
data.x - I have no idea whether this is the transformed, selected
data for the estimation example or the full dataset, or just some
intermediate data
documentation
examples: What do people use, when there is a specific dataset
x_longley ?
x_grunfeld ?
my style:
>>> sorted([name for name in locals().keys() if name[0] in ['e', 'y', 'x']])
['endog_aircraft', 'endog_sal', 'endog_sal0', 'endog_wood',
'exog_aircraft', 'exog_sal', 'exog_sal0', 'exog_wood']
Josef