import numpy as np
import statsmodels.api as sm #New users say what is "api"? Not pythonic, one more reason not to use the library.
import statsmodels.formula.api as smf #A separate module if I want to use formulas!?
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit() #lower case ols!
import numpy as np
import statsmodels as sm
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results =sm.OLS('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit() #and I wonder if there shouldn't be some workaround for needing to import and reference np? Who wants "np.log" in their summary output?
OLS(endog_or_formula, exog=None, data=None, ...)
part 1:Why api and not just import everything in statsmodels.__init__Why do I have to import pandas, matplotlib and two thirds of scipy when I just want to run a t-test?
part 2Zen rule 2: Explicit is better than implicit.I don't like the complications and convoluted code that we get from second guessing users. (e.g. pandas' early indexing, is it row or column or a bit of both or neither?), and the ambiguity of having arguments that have different meanings. (Although I start to get a little bit less pessimistic about multiple dispatch in general. (*))
You can always do `OLS.from_formula(formula, data, ...)`
IMO, saving having to type a few extra characters or having to use auto-completion is not worth the extra ambiguity and complexity.
I wonder if there shouldn't be some workaround for needing to import and reference np?from numpy import logresults = smf.ols('Lottery ~ Literacy + log(Pop1831)', data=dat).fit()patsy evaluates in the environment. What the function is called doesn't make a difference, e.g.from numpy import log as Lnand then use Ln in the formulas
Once upon the time the standard wasfrom numpy import *from pylab import *I never liked this.I never liked it that it was impossible (very difficult) to use namespaces in matlab.When I read Julia code every once in a while then I have no idea where a considerable fraction of names is coming from.
When reading below, bear in mind that what I finally started to appreciate is that econometrics in python is more frustration free when using the patsy interface than when having to define dozens of calculated variables. This is what the new user experience should be and reducing the number of steps to get the new user there should be a priority.
part 1:Why api and not just import everything in statsmodels.__init__Why do I have to import pandas, matplotlib and two thirds of scipy when I just want to run a t-test?I'm not convinced. This seems like sacrificing clarity for an implementation detail. There are other approaches like lazy imports that would conserve memory without giving up clarity (although this should be something that's implemented python wide). As soon as you try to touch anything useful you're going to be importing the bulk of those libs anyway, as you noted. It took me a long time to appreciate what statsmodels.api was actually doing compared to just importing statsmodels, and I was a long time confused about where stuff that was defined in the api was supposed to be in the documentation. Most people, especially new users, aren't going to look at your source code to figure it out.
just replying to two general pointsOn Fri, Oct 27, 2017 at 3:05 PM, Damien Moore <damien...@gmail.com> wrote:When reading below, bear in mind that what I finally started to appreciate is that econometrics in python is more frustration free when using the patsy interface than when having to define dozens of calculated variables. This is what the new user experience should be and reducing the number of steps to get the new user there should be a priority.I'm not coming from R, and in most cases I don't use formulas.
And for applications where speed matters it is better to avoid formulas, both in statsmodels and in R.
Formulas are very convenient for handling interaction effects and statsmodels can provide some features only when it has the formula information.But there are a large number of use cases where we just want to run simple estimation problems on hundreds or thousands of cases. Running formulas each time is just a waste.
The general point is that a newbie is a newbie for a month or two, but uses it (hopefully) for years. So I'm not sacrificing code/API clarity for convenience, when there is a large enough cost of doing so,(Same idea that code is read more often than written.)
Six to eight years ago there were several packages that tried lazy import. As far as I know, they all gave up. It didn't work well and was much too complicated.
Even if we are not as successful as before, it is still faster to import parts than everything, i.e., the statsmodels.api. This is especially important for library use and not so much for interactive work where everything is loaded once and then used for a long time.(The main problems are scipy.stats and pandas, but I'm not giving up yet.)
In comparison to other packages: I think the programs "written by statisticians for statisticians" hides too much behind the scenes, and I always had problems to discover what is hiding in the formulas.
What I like about a commercial package that is popular among applied economists and some other fields is that everything follows the same pattern and is well integrated. This is one of my objectives for statsmodels, and currently there is a single pattern for all models that users have to learn only once, or maybe a few patterns given the different underlying structure in time series analysis.
note that in 0.21.0 pandas now will lazily import pytest and matplotlib so import times should be down a bit
Picking this back up...
On Friday, October 27, 2017 at 3:57:50 PM UTC-4, josefpktd wrote:just replying to two general pointsOn Fri, Oct 27, 2017 at 3:05 PM, Damien Moore <damien...@gmail.com> wrote:When reading below, bear in mind that what I finally started to appreciate is that econometrics in python is more frustration free when using the patsy interface than when having to define dozens of calculated variables. This is what the new user experience should be and reducing the number of steps to get the new user there should be a priority.I'm not coming from R, and in most cases I don't use formulas.I don't come from R either, but in lots of cases, particularly those that a novice user is interested in (or really for anyone doing exploratory data analysis), the formula interface is better. It creates a graceful way to create a whole hosted of calculated variables without blowing up the original dataset.And for applications where speed matters it is better to avoid formulas, both in statsmodels and in R.Formulas are very convenient for handling interaction effects and statsmodels can provide some features only when it has the formula information.But there are a large number of use cases where we just want to run simple estimation problems on hundreds or thousands of cases. Running formulas each time is just a waste.Again, I'm not talking about making it impossible to do it the way you like (and I like too in some cases), just changing the default to make the more beginner friendly thing more accessible from the outset. As a fellow economist, I thought you would appreciate the potency of good defaults. :) The use cases you are focused on come up more frequently in an academic context than in applied work.The general point is that a newbie is a newbie for a month or two, but uses it (hopefully) for years. So I'm not sacrificing code/API clarity for convenience, when there is a large enough cost of doing so,(Same idea that code is read more often than written.)I'm not seeing the cost in clarity. As I said above, there's a huge clarity cost that comes with the namespace rearrangement when importing statsmodels.api which every new users is told to do and that imported namespace no longer lining up with the reference docs.Six to eight years ago there were several packages that tried lazy import. As far as I know, they all gave up. It didn't work well and was much too complicated.For something that uses a lot of C or Cython or needs really fine grained memory management I can see that could be a problem, but it's a pretty simple monkey patch to override the importer to create a proxy object that defers the actual import until objects in it are referenced. It's getting more finegrained than this that's challenging, but for statsmodels I doubt that's necessary. The main thing would be to make sure nothing in the API actually tries to touch the internals of scipy/pandas/other big libraries on import.
Even if we are not as successful as before, it is still faster to import parts than everything, i.e., the statsmodels.api. This is especially important for library use and not so much for interactive work where everything is loaded once and then used for a long time.(The main problems are scipy.stats and pandas, but I'm not giving up yet.)I guess I'm not really seeing the cases where people are using small parts of the core. But whether or not that is true many large libraries have had to deal with this problem and none of them that I know of end up with mylib.api import convention.
In comparison to other packages: I think the programs "written by statisticians for statisticians" hides too much behind the scenes, and I always had problems to discover what is hiding in the formulas.The patsy interface is well documented so I don't think that's a concern. (But maybe a valid concern against defining log, exp, and lag operations).