exog / endog | VincentAB | 7/22/12 5:54 PM | This thread is for discussion of statsmodels' exog/endog naming convention. This post summarizes this exchange: https://github.com/statsmodels/statsmodels/issues/395 Status quo ======== * Most models take 2 design matrices as arguments. Currently, they are named "exog" and "endog" Problems ======= * Feedback from multiple new users: the "endog" and "exog" names are a stumbling block for `statsmodels` adoption. These labels are often encountered in econometrics, but not in many other areas of applied statistics (even by users who have a strong stats background). * In many contexts, the endog/exog distinction is ambiguous or a misnomer (e.g. models with endogenous regressors). Option 1: Improving on the status quo ============================ * explanation for endog and exog should be more prominent (e.g. on the getting started page) Benefits: * Much less work! * exog/endog is informative after you've learned what it means * Moving away would be a big break with backwards compatibility and the current style * "I thought of endog/exog as a nice distinguishing "feature" of statsmodels which also signals consistency across models." Option 2: y, X , Z ============= Benefits: * Ubiquitous/Standard: used in lots of textbooks and by nearly all competitor software (e.g. R, Matlab, Stata) * Immediately understandable by most; no longer a stumbling block for new users * Consistent with related Python work (e.g. Pandas, PySal) Problems: * Non-pythonic * Any introspection in a class that has one letter (or one greek letter) attributes is mostly useless. * Non-expressive? "It always takes me an extra half a minute to figure out which is which, A, x, b for example." "Try to read the VAR code without the text book." Response: * "I don't think this is as drastically non-expressive as it seems on first glance. It's nearly universal. I'm not wild about one variable names either, it's non-pythonic, but I'm starting to come around to the viewpoint that endog and exog are an unnecessary technical/jargon hurdle." * One-letter names in code are a completely different issue. "No one is arguing for using more one-letter variables all over the place, just x, y [and maybe z]." Option 3: indep_var / dep_var ====================== Benefits: * Not single letter * More informative than X,y? Problems: * Sometimes ambiguous or misnomer. Other stuff ======== * Is Formula integration a good or bad time to do this? * IIRC, there was a brief message exchange that the sklearn developers are not so happy with the one letter names anymore. * "How about moving to X & Y but describing them as exogenous & endogenous in the docs? That way the library keeps its econ heritage but moves to a generic naming convention." |

Re: [pystatsmodels] exog / endog | Matthew Brett | 7/22/12 7:35 PM | Hi,
Strongly in favor of Option2 - unambiguous and clear to a wide range of backgrounds (I believe). Compatibility with other software seems an obvious win. I can never remember which one is which of endog / exog. Dependent / independent is closer to my own tradition but not less inherently confusing, in my opinion. Any reason to prefer 'y' over 'Y'? Or the other way round? I guess Y can sometimes be a matrix? Thanks for bringing up the discussion. Best, Matthew |

Re: [pystatsmodels] exog / endog | josefpktd | 7/22/12 7:45 PM | simple: exog has an x in it
Josef |

Re: [pystatsmodels] exog / endog | josefpktd | 7/22/12 8:19 PM | Just as a reminder: We are trying to write a python library with full
classes and a coverage much more than linear models, not just a single function that does OLS (or collection of functions) compare introspection in Python with introspection in Matlab, Stata, and R (and names with Java or Csharp conventions(?)) as any paper in econometric, we will need x x_star x_boldface x_cursive ... z_that_is_a_transformed_x x_shortened ... Josef |

Re: [pystatsmodels] exog / endog | josefpktd | 7/22/12 8:36 PM | On Sun, Jul 22, 2012 at 10:35 PM, Matthew Brett <matthe...@gmail.com> wrote:endog is a 1d or 2d array depending on the model and usage (until now it's most of the time but not always 1d) exog can be None, 1d, or 2d, in most cases it's 2d Josef |

Re: [pystatsmodels] exog / endog | Alexandre Gramfort | 7/22/12 11:52 PM | hi,
of course I am a bit biased by the scikit-learn naming convention but if I may share my experience, I have to look at the OLS example to map exog -> X and endog -> y every time I use statsmodels (but I may not use it enough :) ). exog and endog does not ring a bell to me. X and y in scikit-learn is not ideal but it's easy to remember and that's the only variables that we tolerate as single letters. I think it's a good thing to avoid domain specific jargon as much as possible in a project that has such a widespread use. my 2c Alex |

Re: [pystatsmodels] exog / endog | Alan G Isaac | 7/23/12 5:28 AM | Just a reminder that we already hashed this all out long ago.
Alan Isaac |

Re: [pystatsmodels] exog / endog | josefpktd | 7/23/12 5:37 AM | On Mon, Jul 23, 2012 at 8:28 AM, Alan G Isaac <ais...@american.edu> wrote:This is mostly a reminder for 3 econometricians. I'm still convinced that we took the right decision. But I think we didn't explain this well enough in the documentation. Josef > > Alan Isaac |

Re: [pystatsmodels] exog / endog | Wes McKinney | 7/23/12 5:37 AM | On Mon, Jul 23, 2012 at 8:28 AM, Alan G Isaac <ais...@american.edu> wrote: > Just a reminder that we already hashed this all out long ago.Yes, but the project needs to be prepared to make some changes in the best interests of the project's future given the growth of the user base. The difference between 100 (then) and 10k (now) or 100k users is significant. Don't you want more users? Out of more users comes more developers, too. It seems to me like a small price to pay. For my part, I've found that exog/endog is a problem when teaching statsmodels to a non-econometric crowd. Even after using the library for a long time, the names still cause me cognitive dissonance. - Wes |

Re: [pystatsmodels] exog / endog | jseabold | 7/23/12 7:00 AM | Yes, to echo and add, I brought this up on github after the tutorial
at scipy last week (and I heard the same comments last year and the year before). I had to stop and spend a few minutes explaining endog and exog, was met with incredulous looks, and a few people left shortly thereafter. Many in the audience were sophisticated users with (applied) statistics or machine learning but not necessarily econometrics backgrounds and their aversion to this was palpable. We can try to rationalize away other user's difficulties, but the fact remains that this is a stumbling block for new users, especially in a "live" setting. Word of mouth travels, and I'd hope that "easy to use" would be the first word, not "it's ok but all the models take this endog and exog stuff, which I have to figure out every time." Skipper |

Re: [pystatsmodels] exog / endog | jseabold | 7/23/12 7:31 AM | On Sun, Jul 22, 2012 at 8:54 PM, VincentAB <vincen...@gmail.com> wrote:Thanks for summarizing! This is very helpful. This also assumes that users bring up our docs every time they use the library, though I will say that if we switched to having more documentation templating that every model could have a clear description of endog, exog that we write once and use everywhere. I'm mildly in favor of biting the bullet and making this change. As I mention down-thread, this is mostly due to endog/exog not being _immediately_ clear without explanations, without bigger, more, and better docs or assuming that everyone has an econometrics background. I think Josef has shared the sentiment that users of a library shouldn't even have to look at the docs. It should be obvious from the call signature what the main arguments are for. We've been told that we don't have this right now. Everyone we've heard from that's not Josef, Alan, or I is in favor of changing this (though I'll wait for this thread to take its course). For users, old and new, (and new developers) this seems to be a strongly desired switch. I don't think this feedback should be ignored without considering what we gain by keeping endog and exog vs. switching. The main concern from developers (other than it's going to be a PITA to switch, which is true but shouldn't be a deal breaker I don't think.) is that we're going to all of a sudden have unclear and obfuscated code. I'm arguing that this, while generally a valid concern, is not going to happen just from switching endog -> y (or Y for 2d?) and exog -> X. We have well commented, modular (mostly, when I'm not knee deep in spaghetti) code. Changing the default variable names to another _consistent_, _documented_ but more generally accepted pattern isn't going to undo that. I don't think this gains us much over endog/exog. It may solve the jargon issue but still suffers from the ambiguity issue. I'm not sure about this yet, but it's my suspicion that this is the right time to do this. We're still a young code-base, but I think this is moving us in the right direction of becoming more than a developer library. This is a bit out of context and I believe was discussed for using one letter names for, perhaps non-standard, tuning parameters. IIRC it was spurred by the one letter tuning parameter in SVM being different than the parameter of the same name in libsvm. I think this is more in the spirit of what we should striving for. My $.02, Skipper |

Re: [pystatsmodels] exog / endog | Nathaniel Smith | 7/23/12 7:33 AM | On Mon, Jul 23, 2012 at 3:00 PM, Skipper Seabold <jsse...@gmail.com> wrote:This is probably a good place to admit that technically this does not describe my experience, because while I can never remember the difference between endog and exog, I also have never succeeded in figuring it out by googling. Mostly when writing emails to this list I've just given up and used other terminology, or else just named one thing "endog" and one thing "exog" at random and figured that if I got them backwards then you'd still understand. Wikipedia even has a nice article on what "endogenous" means in economics, which seems like it would be helpful, but their usage is completely inconsistent with statsmodels' usage...: https://en.wikipedia.org/wiki/Endogeneity_%28economics%29 FWIW, I think it's misguided to object to "X" and "y" *simply* on the grounds that single-letter variable names are bad. The important thing is for variable names to be clear, and to not give in to the temptation to use abbreviations that only make sense when writing the code and not when reading it later. So single letter variable names are only justified when they really are the best name, and are sufficiently established that later readers will be able to recognize the meaning from that single letter. The classic example is using "i" as an index variable in a loop -- that's totally okay. I think one can reasonably argue that X and y meet those criteria. (Technically 'beta' and 'sigma' are also single-letter variable names.) -n |

Re: [pystatsmodels] exog / endog | Alan G Isaac | 7/23/12 7:35 AM | On 7/23/2012 10:00 AM, Skipper Seabold wrote:Perhaps it would be better in such settings to just say `endog` is our name for Y, and `exog` is our name for X, if that is the notation they are used to? Whatever the final choice, I hope it will fit well in systems estimation frameworks, and in particular I hope that it easily allows a clear distinction between lagged endogs and other predetermined variables. (Obviously I am not suggesting that choice of name alone will have implications for such issues.) Alan Isaac |

Re: [pystatsmodels] exog / endog | josefpktd | 7/23/12 7:51 AM | a quick check with Stata:
regress depvar [indepvars] [if] [in] [weight] [, options] the gui spells out "dependent variable" and "independent variable" SAS Syntax Syntax: REG Procedure The following statements are available in PROC REG: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_sect006.htm PROC REG <options> ; <label:>MODEL dependents=<regressors> </ options> ; BY variables ; ... examples use y and x as place holders Matlab: What's W,X,Y0,W0 ? http://www.mathworks.com/help/toolbox/econ/vgxproc.html Generate VARMAX model responses from innovations Synopsis [Y,logL] = vgxproc(Spec,W) [Y,logL] = vgxproc(Spec,W,X,Y0,W0) explanation: X Exogenous data SPSS: http://academic.udayton.edu/gregelvers/psy216/SPSS/reg.htm pictures of GUI Dependent, Independent(s) only matlab and R seem to favor x,y or X,y, or X,Y, and in R you usually don't see them if you have to use formulas. I like Stata. Josef > > Skipper |

Re: [pystatsmodels] exog / endog | jseabold | 7/23/12 8:05 AM | Just to point out, VAR is an estimator (almost entirely) particular to
econometrics. For the record, I'm not against depvar vs. indepvar, but I will point out two things. 1) Stata (as an example since I'm more familiar with programming estimators in it) doesn't use OO, so as soon as you get down the inheritance chain (for us), you have things like sureg (depvar1 varlist1) (depvar2 varlist2) ... (depvarN varlistN) [if] [in] [weight] reg3 (depvar1 varlist1) (depvar2 varlist2) ...(depvarN varlistN) [if] [in] [weight] Y and X won't have the problem of having to switch nomenclature when it's no longer theoretically correct I guess. 2) MATLAB and R are peculiar in that they're also programming languages. Stata, SAS, and SPSS are primarly GUIs (Mata and IML aside), so that you're never actually typing depvar = y. These are mainly just used for documentation, which we can certainly include. SAS I guess is the odd man out in that you are actually typing dependents = ... I guess, but who wants to be more like SAS? Skipper |

Re: [pystatsmodels] exog / endog | josefpktd | 7/23/12 9:11 AM | we still need to be explicit for some cases in systemfit (or maybe
not, if the instruments are chosen correctly) Model 2 exog(varlist) exogenous variables not specified in system equations endog(varlist) additional right-hand-side endogenous variables inst(varlist) full list of exogenous variables http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_syslin_sect044.htm I don't understand much in the R docs ? http://rss.acs.unt.edu/Rdoc/library/systemfit/html/systemfit.html the version on my computer is even shorter, and seems to be more recent But like VAR that's again "econometrics" Next time I try to find something statistics, but it's not so easy to find something without E(X \epsilon) != 0 (maybe GEE) Josef >SAS has very good documentation, and I wouldn't mind selling a license for $100000 > > Skipper |

Re: [pystatsmodels] exog / endog | achompas | 7/23/12 12:07 PM |
That's my suggestion from the Github Issues branch, and I think it's one of two "right ways" to resolve this problem (the other being "stick with exog & endog but identify them as statsmodels's analogues for X & y"). Here's why I see these as the "right solutions" for this problem. First, let's identify the issue. What's the problem here? New users can't map function calls to their knowledge of statistical models. Let's say I want to use (or contribute to) statsmodels for the first time. Then, in IPython: import statsmodels.api as sm sm.GLS? Then...wait, what do I do? Here's the relevant part of the docstring: Generalized least squares model with a general covariance structure. Parameters ---------- endog : array-like endog is a 1-d vector that contains the response/independent variable exog : array-like exog is a n x p vector where n is the number of observations and p is the number of regressors/dependent variables including the intercept if one is included in the data. First, this docstring is wrong since endog is the dependent variable while exog is the independent variable. That said, the corrected version might make sense to some veteran econometricians. For others, they won't know how to map this to their textbook understanding of GLMs. Then they'll hit Ctrl-D, enter 'R' at the bash prompt, and get back to work.If we're thinking of statsmodels as an engineering project, what should its developers do? Take action that (a) doesn't require a lot of development time, yet (b) manages to thoroughly resolve the problem for the foreseeable future. I really really want X and y, but that will take a full refactor. What's the next best step? Keep endog and exog but update the docs so new devs/users can map them to variables they know.pandas and statsmodels are two cool libraries. As more people start working with pandas and pandas.TimeSeries (like me with my current project), they'll need to analyze them with statistical or econometric tools. statsmodels is ripe for new users and more contributors, so let's pick a solution that maximizes adoption without requiring too much dev time. |

Re: [pystatsmodels] exog / endog | jseabold | 7/23/12 12:13 PM | On Mon, Jul 23, 2012 at 3:07 PM, achompas <bre...@gmail.com> wrote:Well that's no good. Stop-gap: https://github.com/statsmodels/statsmodels/commit/a515676737168568a9a834f7693199d55f8c7a4f Skipper |

Re: [pystatsmodels] exog / endog | Christoph Deil | 7/23/12 7:48 AM | I'm an astronomer and I've used statsmodels a few times, and this is exactly how I feel about statsmodels. +1 on changing exog / endog to something else, because statsmodels is the only place where I've ever heard these terms and every time I use statsmodels (which is only a few times per year) I have to re-learn them. Wikipedia doesn't know anything about "exog" and "endog", using google I couldn't find anything useful quickly. Even in the statsmodels docs I couldn't find a good explanation. -1 on any of the proposals to use exog / endog internally and x / y in the user interface or the other way around, code / papers where exactly the same thing is called by different names in different places are just unnecessarily confusing. Here's my proposal for a rename: xdata for what is called x at the moment. ydata for what is called y or endog at the moment. dmatrix or design_matrix for what is called X or exog at the moment. This way there are no one-letter variable names and there is zero chance to confuse x and X. I did not come up with these names myself, there is precedence. :-) xdata and ydata is used in http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html and dmatrix is used in http://patsy.readthedocs.org/en/v0.1.0/formulas.html . Independently of the choice made, I think it is great that Skpper, Vincent and others are trying to make the statsmodels docs more accessible. At the moment it's really hard, because the statsmodels docs more or less start at http://statsmodels.sourceforge.net/devel/regression.html like this: "Regression contains linear models with independently and identically distributed errors and for errors with heteroscedasticity or autocorrelation. The statistical model is assumed to be Y = X * beta + mu, where mu ~ N(0, sigma ** 2 * Sigma), ..." without explaining that the goal basic purpose of the code is to "compute the best-fit parameters beta given inputs X, Y and mu and that X is the design matrix and how to construct the design matrix for common cases (polynomials, hyperplanes). Just to be clear, these are the terms physicists understand: data (x, y), linear or nonlinear model, parameter, parameter error, fit. Physicists have never heard of: exog, endog, not even design matrix. (I just asked ~ 10 physicists / astronomers on my corridor, not a single one knew any of those three terms, even after I mentioned that they are used in conjunction with fitting linear models.) I can see now that the term design matrix is very central to fitting linear models, so it should be in statsmodels, but I sure would like to see exog and endog go and design matrix better explained in the docs. Here's a little code example using ROOT, the data analysis package most widely used by physicists, that shows an API that they would understand: ROOT has python bindings ( http://root.cern.ch/drupal/content/how-use-use-python-pyroot-interpreter )and a formula framework ( http://root.cern.ch/root/html/TF1.html ) that allows easily defining and fitting linear and non-linear models. I don't mean to say that ROOT and the python bindings are all gold, there are both small annoying and serious principal problems with using ROOT from python, and more and more physicists are using numpy / scipy, and hopefully soon also statsmodels if you guys manage to make it understandable for us. Thanks for working on statsmodels and considering huge API changes with renaming the basic input parameters and integrating formulas. I believe it would make statsmodels much more accessible to physicists (and I believe most other scientists and engineers, basically most data analysts outside econometrics / statistics?) and would be worth the trouble now in the long run. Christoph |

Re: [pystatsmodels] exog / endog | josefpktd | 7/24/12 5:47 PM | Thanks Christoph, pretty convincing arguments below, the science and
engineering tradition of numpy and scipy has partially slipped my mind. definitely a documentation failure That's exactly what I was thinking of as a compromise, use x,y in the signature and some more informative names in the models and result instances. (more below) xdata and ydata doesn't really look more informative than x, y I never heard of a design_matrix before stats.models. That's an unknown or (essentially) unused name in econometrics, since in econometrics we seldom *design* or data. R lm uses model and x for the dataframe and matrix of regressors. (I don't know the details.) "model" is at least a weird name for a matrix of explanatory variables. I'm not sure in these cases whether it's the implementation that is relevant or just the examples that use x and y. I can see that exog/endog doesn't have much meaning outside of econometrics, social and a few other sciences. My problem is that I really don't like letter names. x, y, i are in my opinion temporary variables. I try not to use i as a loop index unless the loop is just a few lines or a list comprehension. In longer loops I always worry I might have used `i` already and better not use it at all. x and y are also generic names, I might have used them as temp variable, which would accidentally overwrite the real ones. (xdata and ydata sounds better again.) As alternative to exogenous and endogenous variables, I think, the only ones that are not a misnomer in some cases are dependent variable and explanatory variables, independent variable is a nicer name but means roughly the same as exog. I never found a good short name for explanatory variable. How deep do we want to change? Given that Alan and I are a small minority, let's assume we switch to x and y. Changing the signature of the models is easy OLS(y, x) RLM(y,x) The question is what to do internally. OLS, WLS, GLS, RLM, GLM and discrete: large parts of our current core models are easy y=endog, x=exog, Then, there are wexog, wendog. Do any users care what ols_results.models.wendog is called? In tsa it gets a bit more complicated, VARX has the regression matrix [past y, constant, trend, and real x] (where x=exogenous variables and not yet implemented) (and the regression matrix is shortened relative to the full data) ARX, ARMAX similar to VARX, except it uses Kalman Filter and state space representation. some discrete and GLM models allow for 2d y/exog that stacks some additional information. ... multi-equation models GLSHet(endog, exog, exog_var=None, weights=None, link=None) system of equations ... ..... datasets from statsmodels.datasets.longley import load data = load() data.endog, data.exog - I know from the name that these are for the estimation data.x - I have no idea whether this is the transformed, selected data for the estimation example or the full dataset, or just some intermediate data documentation examples: What do people use, when there is a specific dataset x_longley ? x_grunfeld ? my style: >>> sorted([name for name in locals().keys() if name[0] in ['e', 'y', 'x']]) ['endog_aircraft', 'endog_sal', 'endog_sal0', 'endog_wood', 'exog_aircraft', 'exog_sal', 'exog_sal0', 'exog_wood'] Josef |

Re: [pystatsmodels] exog / endog | VincentAB | 7/24/12 6:20 PM | re: examples Now that `patsy` is there and that creating design matrices from raw data takes just a single line of code, I don't think that examples should ever use the exog and endog matrices stored in statsmodels data objects. 1) Doing that is no more compact or clear than calling patsy to create design matrices from scratch using the raw data 2) Using the data.exog/endog attributes forces users to learn something about the structure of the datasets objects. Most users shouldn't have to care about that since they are unlikely to use these data in actual work. 3) Going through patsy in examples is good practice and a nice teaching opportunity in terms of giving users the tools/knowledge needed to integrate statsmodels in their analysis workflow. In my view, the exog and endog attributes of datasets should mostly be there for convenience in internal testing, and users should have minimal exposure to them. If this becomes the case^*, then the naming convention won't matter much. FWIW, I think data.X is fine for data.exog (but capitalization is probably important to denote matrix form). Vincent * I can probably help with tweaking examples if people agree with the above. |

Re: [pystatsmodels] exog / endog | jseabold | 7/24/12 6:24 PM | On Tue, Jul 24, 2012 at 9:20 PM, VincentAB <vincen...@gmail.com> wrote:Briefly. This was their original intention. I've mainly switched to using load_pandas() for examples to be closer to what people are really doing. I don't think we would need to rename these. Skipper |

Re: [pystatsmodels] exog / endog | josefpktd | 7/25/12 12:36 AM | On Tue, Jul 24, 2012 at 9:20 PM, VincentAB <vincen...@gmail.com> wrote: > re: examplesNext up some simple GUI widgets to select your y and x variables, then a full GUI, and users won't realise that there is python inside instead of R or .... And we better go commercial before somebody else does. Josef |

Re: [pystatsmodels] exog / endog | Alexandre Gramfort | 7/25/12 2:06 AM | hi,
if I may just share more from my experience with scikit-learn, we have the convention to use capital letters for 2d or more arrays. So we use X and slowly moving from y to Y in estimators that can work with multiple outputs. Another convention we use is to add a trailing underscore to quantities estimated from the data. For example, if beta is the regression coefficients we use beta_ . That's pretty convenient to inspect an estimator instance. It would great to have a consensus on this. my 0.02 euros cents. Alex |

Re: [pystatsmodels] exog / endog | VincentAB | 7/25/12 5:42 AM | Hehe! The point is that the data set object structure doesn't do *anything* useful for the user in his actual work, so he shouldn't have to learn it. This is not an instance of syntactic sugar hiding other *useful* commands. I'm not advocating for all-automated-all-the-time. Many examples still create their own artificial data Xs and Ys anyway. Vincent |

Re: [pystatsmodels] exog / endog | jseabold | 7/25/12 10:06 AM | On Tue, Jul 24, 2012 at 8:47 PM, <josef...@gmail.com> wrote:
> On Mon, Jul 23, 2012 at 10:48 AM, Christoph Deil > <Deil.Ch...@googlemail.com> wrote: >> >> On Jul 23, 2012, at 4:00 PM, Skipper Seabold wrote: >> >> On Mon, Jul 23, 2012 at 8:37 AM, Wes McKinney <wesm...@gmail.com> wrote: >> >> On Mon, Jul 23, 2012 at 8:28 AM, Alan G Isaac <ais...@american.edu> wrote: >> >> Just a reminder that we already hashed this all out long ago. >> >> >> Alan Isaac >> >> >> Yes, but the project needs to be prepared to make some changes in the >> >> best interests of the project's future given the growth of the user >> >> base. The difference between 100 (then) and 10k (now) or 100k users is >> >> significant. Don't you want more users? Out of more users comes more >> >> developers, too. It seems to me like a small price to pay. >> >> >> For my part, I've found that exog/endog is a problem when teaching >> >> statsmodels to a non-econometric crowd. Even after using the library >> >> for a long time, the names still cause me cognitive dissonance. >> >> >> >> Yes, to echo and add, I brought this up on github after the tutorial >> at scipy last week (and I heard the same comments last year and the >> year before). I had to stop and spend a few minutes explaining endog >> and exog, was met with incredulous looks, and a few people left >> shortly thereafter. Many in the audience were sophisticated users with >> (applied) statistics or machine learning but not necessarily >> econometrics backgrounds and their aversion to this was palpable. >> >> We can try to rationalize away other user's difficulties, but the fact >> remains that this is a stumbling block for new users, especially in a >> "live" setting. Word of mouth travels, and I'd hope that "easy to use" >> would be the first word, not "it's ok but all the models take this >> endog and exog stuff, which I have to figure out every time." >> >> Skipper >> >> >> I'm an astronomer and I've used statsmodels a few times, and this is exactly >> how I feel about statsmodels. > > Thanks Christoph, pretty convincing arguments below, the science and > engineering tradition of numpy and scipy has partially slipped my > mind. > Ah, yeah. The crowd at scipy was mainly machine learning and stats, finance, astronomy, engineering, science (bio, ecology). None of these people knew endogenous / exogenous. >> >> +1 on changing exog / endog to something else, because statsmodels is the >> only place where I've ever heard these terms and every time I use >> statsmodels (which is only a few times per year) I have to re-learn them. >> Wikipedia doesn't know anything about "exog" and "endog", using google I >> couldn't find anything useful quickly. Even in the statsmodels docs I >> couldn't find a good explanation. > > definitely a documentation failure > >> >> -1 on any of the proposals to use exog / endog internally and x / y in the >> user interface or the other way around, code / papers where exactly the same >> thing is called by different names in different places are just >> unnecessarily confusing. > > That's exactly what I was thinking of as a compromise, use x,y in the > signature and some more informative names in the models and result > instances. (more below) > >> >> Here's my proposal for a rename: >> xdata for what is called x at the moment. >> ydata for what is called y or endog at the moment. > > xdata and ydata doesn't really look more informative than x, y > >> dmatrix or design_matrix for what is called X or exog at the moment. > > I never heard of a design_matrix before stats.models. That's an > unknown or (essentially) unused name in econometrics, since in > econometrics we seldom *design* or data. > > R lm uses model and x for the dataframe and matrix of regressors. (I > don't know the details.) "model" is at least a weird name for a matrix > of explanatory variables. > >> >> This way there are no one-letter variable names and there is zero chance to >> confuse x and X. >> I did not come up with these names myself, there is precedence. :-) >> xdata and ydata is used in >> http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html >> and dmatrix is used in http://patsy.readthedocs.org/en/v0.1.0/formulas.html >> . Off topic, and I brought this up before so it may not be the case anymore, but IIRC, to confuse the issue patsy is also using dmatrix for the LHS... >> >> Independently of the choice made, I think it is great that Skpper, Vincent >> and others are trying to make the statsmodels docs more accessible. >> At the moment it's really hard, because the statsmodels docs more or less >> start at http://statsmodels.sourceforge.net/devel/regression.html like this: >> "Regression contains linear models with independently and identically >> distributed errors and for errors with heteroscedasticity or >> autocorrelation. >> The statistical model is assumed to be Y = X * beta + mu, where mu ~ N(0, >> sigma ** 2 * Sigma), ..." >> without explaining that the goal basic purpose of the code is to "compute >> the best-fit parameters beta given inputs X, Y and mu and that X is the >> design matrix and how to construct the design matrix for common cases >> (polynomials, hyperplanes). >> >> Just to be clear, these are the terms physicists understand: data (x, y), >> linear or nonlinear model, parameter, parameter error, fit. >> Physicists have never heard of: exog, endog, not even design matrix. >> (I just asked ~ 10 physicists / astronomers on my corridor, not a single one >> knew any of those three terms, even after I mentioned that they are used in >> conjunction with fitting linear models.) >> I can see now that the term design matrix is very central to fitting linear >> models, so it should be in statsmodels, but I sure would like to see exog >> and endog go and design matrix better explained in the docs. >> >> Here's a little code example using ROOT, the data analysis package most >> widely used by physicists, that shows an API that they would understand: >> https://gist.github.com/3163783 >> ROOT has python bindings ( >> http://root.cern.ch/drupal/content/how-use-use-python-pyroot-interpreter >> )and a formula framework ( http://root.cern.ch/root/html/TF1.html ) that >> allows easily defining and fitting linear and non-linear models. I don't >> mean to say that ROOT and the python bindings are all gold, there are both >> small annoying and serious principal problems with using ROOT from python, >> and more and more physicists are using numpy / scipy, and hopefully soon >> also statsmodels if you guys manage to make it understandable for us. > > I'm not sure in these cases whether it's the implementation that is > relevant or just the examples that use x and y. > >> >> Thanks for working on statsmodels and considering huge API changes with >> renaming the basic input parameters and integrating formulas. >> I believe it would make statsmodels much more accessible to physicists (and >> I believe most other scientists and engineers, basically most data analysts >> outside econometrics / statistics?) and would be worth the trouble now in >> the long run. >> >> Christoph >> > > I can see that exog/endog doesn't have much meaning outside of > econometrics, social and a few other sciences. > > My problem is that I really don't like letter names. x, y, i are in my > opinion temporary variables. I try not to use i as a loop index unless > the loop is just a few lines or a list comprehension. In longer loops > I always worry I might have used `i` already and better not use it at > all. x and y are also generic names, I might have used them as temp > variable, which would accidentally overwrite the real ones. (xdata and > ydata sounds better again.) > > As alternative to exogenous and endogenous variables, I think, the > only ones that are not a misnomer in some cases are dependent variable > and explanatory variables, independent variable is a nicer name but > means roughly the same as exog. > I never found a good short name for explanatory variable. > > > How deep do we want to change? > > Given that Alan and I are a small minority, let's assume we switch to x and y. > > Changing the signature of the models is easy OLS(y, x) RLM(y,x) > > The question is what to do internally. > > OLS, WLS, GLS, RLM, GLM and discrete: large parts of our current core > models are easy y=endog, x=exog, Then, there are wexog, wendog. > > Do any users care what ols_results.models.wendog is called? I think calling it whitened_x or whitened_y is even more informative. I had no idea what these were when I first came to this code. The only reference I could find was in some of Jonathan's lecture notes IIRC. > > In tsa it gets a bit more complicated, > VARX has the regression matrix [past y, constant, trend, and real x] > (where x=exogenous variables and not yet implemented) (and the > regression matrix is shortened relative to the full data) > ARX, ARMAX similar to VARX, except it uses Kalman Filter and state > space representation. > ARMAX is already using exog internally to denote the whole RHS I believe. > some discrete and GLM models allow for 2d y/exog that stacks some > additional information. > > ... > > multi-equation models > GLSHet(endog, exog, exog_var=None, weights=None, link=None) > system of equations ... > ..... > > > datasets > > from statsmodels.datasets.longley import load > data = load() > data.endog, data.exog - I know from the name that these are for > the estimation > > data.x - I have no idea whether this is the transformed, selected > data for the estimation example or the full dataset, or just some > intermediate data > I think these maybe could stay. These are mainly for internal/testing use anyway, and I've moslty switched to *.load_pandas() when I'm doing anything other than testing. > > documentation > > examples: What do people use, when there is a specific dataset > > x_longley ? > x_grunfeld ? > > my style: >>>> sorted([name for name in locals().keys() if name[0] in ['e', 'y', 'x']]) > ['endog_aircraft', 'endog_sal', 'endog_sal0', 'endog_wood', > 'exog_aircraft', 'exog_sal', 'exog_sal0', 'exog_wood'] > > Josef |

Re: [pystatsmodels] exog / endog | Nathaniel Smith | 7/26/12 4:23 AM | On Mon, Jul 23, 2012 at 3:51 PM, <josef...@gmail.com> wrote:
> a quick check with Stata: > > regress depvar [indepvars] [if] [in] [weight] [, options] > > the gui spells out "dependent variable" and "independent variable" I'm fine with "dependent" and "independent", but I think that's just because it's the jargon I grew up with -- the actual meaning is not at all transparent. When talking to non-specialists I think I usually refer to the "y" variable as the "outcome" and the "x" variables as "predictors" or "regressors". -n |