Time for an API Cleanup?

Damien Moore

unread,

Oct 24, 2017, 3:50:28 PM10/24/17

to pystatsmodels

Why do I have to do this

import numpy as np
import statsmodels.api as sm #New users say what is "api"? Not pythonic, one more reason not to use the library.
import statsmodels.formula.api as smf #A separate module if I want to use formulas!?
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit() #lower case ols!

instead of

import numpy as np
import statsmodels as sm
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results =sm.OLS('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit() #and I wonder if there shouldn't be some workaround for needing to import and reference np? Who wants "np.log" in their summary output?

Which is the sort of thing that 95 percent of new users would want to do.

The call sig for OLS (and any other estimation class that can accept formulas) could be:

OLS(endog_or_formula, exog=None, data=None, ...)

if endog_or_formula is str, data must not be None

if endog_or_formula is formula, exog must be None

if endog_or_formula is str variable, exog must be str or str list of variables

if endog_or_formula is not str (i.e., array type), exog must be array type

Paul Hobson

unread,

Oct 24, 2017, 3:58:05 PM10/24/17

to pystat...@googlegroups.com

I don't think it's quite so simple.

For instance, it'd make since to keep the API similar to seaborn and matplotlib. That would mean supporting the case where *endog* and *exog* are strings as column names found in a dataframe passed to the *data* parameter.

Differentiating between a formula and column name is easy 95% of the time, but that other 5% is a pain.

I have other ideas about this (e.g., passing Formula objects instead of strings), but I'm not in a place to implement any of that.

-Paul

Paul Hobson

unread,

Oct 24, 2017, 3:59:05 PM10/24/17

to pystat...@googlegroups.com

I wait, I misread a few things. Ignore that previous message.

-p

josef...@gmail.com

unread,

Oct 24, 2017, 5:20:27 PM10/24/17

to pystatsmodels

part 1:

Why api and not just import everything in statsmodels.__init__

http://www.statsmodels.org/devel/importpaths.html

https://github.com/statsmodels/statsmodels/issues/3082

Why do I have to import pandas, matplotlib and two thirds of scipy when I just want to run a t-test?

We are less successful in this now because we import pandas in most modules and pandas imports matplotlib whether we need it or not. Importing scipy.stats imports a large fraction of scipy.

part 2

Zen rule 2: Explicit is better than implicit.

I don't like the complications and convoluted code that we get from second guessing users. (e.g. pandas' early indexing, is it row or column or a bit of both or neither?), and the ambiguity of having arguments that have different meanings. (Although I start to get a little bit less pessimistic about multiple dispatch in general. (*))

You can always do `OLS.from_formula(formula, data, ...)`

IMO, saving having to type a few extra characters or having to use auto-completion is not worth the extra ambiguity and complexity.

I wonder if there shouldn't be some workaround for needing to import and reference np?

from numpy import log

results = smf.ols('Lottery ~ Literacy + log(Pop1831)', data=dat).fit()

patsy evaluates in the environment. What the function is called doesn't make a difference, e.g.

from numpy import log as Ln

and then use Ln in the formulas

Note: We had discussion on this early on, and I think this design is quite good.

Some things are still changing, like extra arrays that we need in some models, e.g. offset, weights, which can in some cases be either string when an associated data argument is available or an array_like (numpy or pandas).

Most of the time I use direct imports, because I like to restart the interpreter or notebooks, and it's faster to import the things that are used instead of importing everything through statsmodels.api. I usually rely on auto-completion for the spelling and saving on typing.

from statsmodels.regression.linear_model import OLS

OLS.f<tab>

Josef

PS, somewhat related and another of my pet peeves

Once upon the time the standard was

from numpy import *

from pylab import *

I never liked this.

I never liked it that it was impossible (very difficult) to use namespaces in matlab.

When I read Julia code every once in a while then I have no idea where a considerable fraction of names is coming from.

Damien Moore

unread,

Oct 27, 2017, 3:05:28 PM10/27/17

to pystatsmodels

When reading below, bear in mind that what I finally started to appreciate is that econometrics in python is more frustration free when using the patsy interface than when having to define dozens of calculated variables. This is what the new user experience should be and reducing the number of steps to get the new user there should be a priority.

part 1:
Why api and not just import everything in statsmodels.__init__
http://www.statsmodels.org/devel/importpaths.html
https://github.com/statsmodels/statsmodels/issues/3082

Why do I have to import pandas, matplotlib and two thirds of scipy when I just want to run a t-test?

I'm not convinced. This seems like sacrificing clarity for an implementation detail. There are other approaches like lazy imports that would conserve memory without giving up clarity (although this should be something that's implemented python wide). As soon as you try to touch anything useful you're going to be importing the bulk of those libs anyway, as you noted. It took me a long time to appreciate what statsmodels.api was actually doing compared to just importing statsmodels, and I was a long time confused about where stuff that was defined in the api was supposed to be in the documentation. Most people, especially new users, aren't going to look at your source code to figure it out.

part 2
Zen rule 2: Explicit is better than implicit.

I don't like the complications and convoluted code that we get from second guessing users. (e.g. pandas' early indexing, is it row or column or a bit of both or neither?), and the ambiguity of having arguments that have different meanings. (Although I start to get a little bit less pessimistic about multiple dispatch in general. (*))

You can always do `OLS.from_formula(formula, data, ...)`

That's what the front page of the docs should show! http://www.statsmodels.org/stable/index.html

IMO, saving having to type a few extra characters or having to use auto-completion is not worth the extra ambiguity and complexity.

Given the statistics context, parsing the first argument for a formulas in a model constructor seem like something that should be expected by most users (especially if the main page shows you that you can do that). It's no less natural than the frequent idion of having functions/methods whose first argument can be a string with a filename or file-like buffer. The only messy edge case is when people decide they want "~" in their variable names. A compromise might be giving exog and endog default arguments of None and adding an optional formula (and data) argument.

I wonder if there shouldn't be some workaround for needing to import and reference np?

from numpy import log
results = smf.ols('Lottery ~ Literacy + log(Pop1831)', data=dat).fit()

patsy evaluates in the environment. What the function is called doesn't make a difference, e.g.

from numpy import log as Ln
and then use Ln in the formulas

I did understand that. I was really pointing out that it makes the interface feel more clunky when you need to explicitly import log or exp in an application domain where those functions are essential workhorses. Those seemingly small steps add up to making the new user experience more convoluted than it needs to be.

PS, somewhat related and another of my pet peeves

Once upon the time the standard was
from numpy import *
from pylab import *

I never liked this.
I never liked it that it was impossible (very difficult) to use namespaces in matlab.
When I read Julia code every once in a while then I have no idea where a considerable fraction of names is coming from.

You won't get any argument from me on this. * imports are terrible in most cases. I stopped using C++ because of the lack of sane namespace usage. (Java might go too far the other way with the excessive nesting.)

josef...@gmail.com

unread,

Oct 27, 2017, 3:57:50 PM10/27/17

to pystatsmodels

just replying to two general points

On Fri, Oct 27, 2017 at 3:05 PM, Damien Moore <damien...@gmail.com> wrote:

When reading below, bear in mind that what I finally started to appreciate is that econometrics in python is more frustration free when using the patsy interface than when having to define dozens of calculated variables. This is what the new user experience should be and reducing the number of steps to get the new user there should be a priority.

I'm not coming from R, and in most cases I don't use formulas. And for applications where speed matters it is better to avoid formulas, both in statsmodels and in R.

Formulas are very convenient for handling interaction effects and statsmodels can provide some features only when it has the formula information.

But there are a large number of use cases where we just want to run simple estimation problems on hundreds or thousands of cases. Running formulas each time is just a waste.

The general point is that a newbie is a newbie for a month or two, but uses it (hopefully) for years. So I'm not sacrificing code/API clarity for convenience, when there is a large enough cost of doing so,

(Same idea that code is read more often than written.)

part 1:
Why api and not just import everything in statsmodels.__init__
http://www.statsmodels.org/devel/importpaths.html
https://github.com/statsmodels/statsmodels/issues/3082

Why do I have to import pandas, matplotlib and two thirds of scipy when I just want to run a t-test?

I'm not convinced. This seems like sacrificing clarity for an implementation detail. There are other approaches like lazy imports that would conserve memory without giving up clarity (although this should be something that's implemented python wide). As soon as you try to touch anything useful you're going to be importing the bulk of those libs anyway, as you noted. It took me a long time to appreciate what statsmodels.api was actually doing compared to just importing statsmodels, and I was a long time confused about where stuff that was defined in the api was supposed to be in the documentation. Most people, especially new users, aren't going to look at your source code to figure it out.

Six to eight years ago there were several packages that tried lazy import. As far as I know, they all gave up. It didn't work well and was much too complicated.

Even if we are not as successful as before, it is still faster to import parts than everything, i.e., the statsmodels.api. This is especially important for library use and not so much for interactive work where everything is loaded once and then used for a long time.

(The main problems are scipy.stats and pandas, but I'm not giving up yet.)

Overall, there are many trade-offs in writing the package. We added statsmodels.formula.api and the aliases to mode.from_formula so it is still convenient to use the formula interface. We also try to have an API that is "reasonably easy" to discover. But that's both a lot of work and user convenience might not be always the highest priority in deciding the trade-offs.

In comparison to other packages: I think the programs "written by statisticians for statisticians" hides too much behind the scenes, and I always had problems to discover what is hiding in the formulas. What I like about a commercial package that is popular among applied economists and some other fields is that everything follows the same pattern and is well integrated. This is one of my objectives for statsmodels, and currently there is a single pattern for all models that users have to learn only once, or maybe a few patterns given the different underlying structure in time series analysis.

from statsmodels.formula.api import ols

result = ols(my_formula, my_data).fit(cov_type='cluster', cov_kwds={'groups': my_data['mygroups']}, use_t=True)

result.wald_test('A + B = 1, C = 0')

or something like this

Josef

Damien Moore

unread,

Nov 8, 2017, 3:45:33 PM11/8/17

to pystatsmodels

Picking this back up...

On Friday, October 27, 2017 at 3:57:50 PM UTC-4, josefpktd wrote:

just replying to two general points

On Fri, Oct 27, 2017 at 3:05 PM, Damien Moore <damien...@gmail.com> wrote:
When reading below, bear in mind that what I finally started to appreciate is that econometrics in python is more frustration free when using the patsy interface than when having to define dozens of calculated variables. This is what the new user experience should be and reducing the number of steps to get the new user there should be a priority.

I'm not coming from R, and in most cases I don't use formulas.

I don't come from R either, but in lots of cases, particularly those that a novice user is interested in (or really for anyone doing exploratory data analysis), the formula interface is better. It creates a graceful way to create a whole hosted of calculated variables without blowing up the original dataset.

And for applications where speed matters it is better to avoid formulas, both in statsmodels and in R.

Formulas are very convenient for handling interaction effects and statsmodels can provide some features only when it has the formula information.
But there are a large number of use cases where we just want to run simple estimation problems on hundreds or thousands of cases. Running formulas each time is just a waste.

Again, I'm not talking about making it impossible to do it the way you like (and I like too in some cases), just changing the default to make the more beginner friendly thing more accessible from the outset. As a fellow economist, I thought you would appreciate the potency of good defaults. :) The use cases you are focused on come up more frequently in an academic context than in applied work.

The general point is that a newbie is a newbie for a month or two, but uses it (hopefully) for years. So I'm not sacrificing code/API clarity for convenience, when there is a large enough cost of doing so,
(Same idea that code is read more often than written.)

I'm not seeing the cost in clarity. As I said above, there's a huge clarity cost that comes with the namespace rearrangement when importing statsmodels.api which every new users is told to do and that imported namespace no longer lining up with the reference docs.

Six to eight years ago there were several packages that tried lazy import. As far as I know, they all gave up. It didn't work well and was much too complicated.

For something that uses a lot of C or Cython or needs really fine grained memory management I can see that could be a problem, but it's a pretty simple monkey patch to override the importer to create a proxy object that defers the actual import until objects in it are referenced. It's getting more finegrained than this that's challenging, but for statsmodels I doubt that's necessary. The main thing would be to make sure nothing in the API actually tries to touch the internals of scipy/pandas/other big libraries on import.

Even if we are not as successful as before, it is still faster to import parts than everything, i.e., the statsmodels.api. This is especially important for library use and not so much for interactive work where everything is loaded once and then used for a long time.
(The main problems are scipy.stats and pandas, but I'm not giving up yet.)

I guess I'm not really seeing the cases where people are using small parts of the core. But whether or not that is true many large libraries have had to deal with this problem and none of them that I know of end up with mylib.api import convention.

In comparison to other packages: I think the programs "written by statisticians for statisticians" hides too much behind the scenes, and I always had problems to discover what is hiding in the formulas.

The patsy interface is well documented so I don't think that's a concern. (But maybe a valid concern against defining log, exp, and lag operations).

What I like about a commercial package that is popular among applied economists and some other fields is that everything follows the same pattern and is well integrated. This is one of my objectives for statsmodels, and currently there is a single pattern for all models that users have to learn only once, or maybe a few patterns given the different underlying structure in time series analysis.

Absolutely agree and I think you have done a very good job with this overall.

Nathaniel Smith

unread,

Nov 8, 2017, 4:03:24 PM11/8/17

to pystatsmodels

On Fri, Oct 27, 2017 at 2:57 PM, <josef...@gmail.com> wrote:
> Six to eight years ago there were several packages that tried lazy import.
> As far as I know, they all gave up. It didn't work well and was much too
> complicated.

https://github.com/njsmith/metamodule/

# Paste this at the top of your module:
import metamodule
metamodule.install(__name__)
del metamodule

# List the submodules you want to be lazily imported on first access
__auto_import__ = {
"submodule1",
"submodule2",
...
}

# You're done.

I tend to agree with the argument that this is a non-issue, but if you
want lazy importing it's very easy to do.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Nathaniel Smith

unread,

Nov 8, 2017, 4:06:11 PM11/8/17

to pystatsmodels

On Tue, Oct 24, 2017 at 2:50 PM, Damien Moore <damien...@gmail.com> wrote:
> The call sig for OLS (and any other estimation class that can accept
> formulas) could be:
>
> OLS(endog_or_formula, exog=None, data=None, ...)
>
> if endog_or_formula is str, data must not be None
> if endog_or_formula is formula, exog must be None
> if endog_or_formula is str variable, exog must be str or str list of
> variables
> if endog_or_formula is not str (i.e., array type), exog must be array type

Patsy's high-level APIs like dmatrix() and dmatrices() actually accept
a (endog_array, exog_array) tuple in place of the formula and do the
right thing, so if statsmodels just passed through the formula and
data arguments to dmatrices then it'd automatically support the
various options here (with slightly different syntax).

josef...@gmail.com

unread,

Nov 8, 2017, 4:24:39 PM11/8/17

to pystatsmodels

maybe that's the reason my import times went up by a huge amount

pandas loads pytest which loads py which loads apipkg which messes with the import system

(I don't like it when packages mess around with my python.)

Josef

Jeff Reback

unread,

Nov 8, 2017, 4:26:35 PM11/8/17

to pystat...@googlegroups.com

note that in 0.21.0 pandas now will lazily import pytest and matplotlib so import times should be down a bit

Nathaniel Smith

unread,

Nov 8, 2017, 4:38:40 PM11/8/17

to pystatsmodels

On Wed, Nov 8, 2017 at 3:24 PM, <josef...@gmail.com> wrote:
>
>
> On Wed, Nov 8, 2017 at 4:03 PM, Nathaniel Smith <n...@vorpus.org> wrote:
>>
>> On Fri, Oct 27, 2017 at 2:57 PM, <josef...@gmail.com> wrote:
>> > Six to eight years ago there were several packages that tried lazy
>> > import.
>> > As far as I know, they all gave up. It didn't work well and was much too
>> > complicated.
>>
>> https://github.com/njsmith/metamodule/
>>
>> # Paste this at the top of your module:
>> import metamodule
>> metamodule.install(__name__)
>> del metamodule
>>
>> # List the submodules you want to be lazily imported on first access
>> __auto_import__ = {
>> "submodule1",
>> "submodule2",
>> ...
>> }
>>
>> # You're done.
>>
>> I tend to agree with the argument that this is a non-issue, but if you
>> want lazy importing it's very easy to do.
>
>
> maybe that's the reason my import times went up by a huge amount
> pandas loads pytest which loads py which loads apipkg which messes with the
> import system

apipkg doesn't mess with the import system, it just messes with the
exports of packages that use it to set up their api.

Loading pytest at import time does seem a bit egregious though, glad
to hear they fixed that.

josef...@gmail.com

unread,

Nov 8, 2017, 4:56:53 PM11/8/17

to pystatsmodels

maybe it's coincidence, but I was trying to figure out what slowed down my imports

and it seems to coincide with our and other packages' switch to pytest.

I also saw that spyder loads "scientific" which I thought I had cleaned out but

I guess came back with an update or a switch in python.

(In my usage I either run things repeatedly on the command line or in new

spyder interpreters/consoles so I don't have to worry about reload or restart my

notebook every few minutes.

For example I'm running some covariance hypothesis tests and I guess the imports

take longer than the computations.)

Josef

josef...@gmail.com

unread,

Nov 8, 2017, 5:01:29 PM11/8/17

to pystatsmodels

On Wed, Nov 8, 2017 at 4:26 PM, Jeff Reback <jeffr...@gmail.com> wrote:

note that in 0.21.0 pandas now will lazily import pytest and matplotlib so import times should be down a bit

Thanks Jeff.

Josef

josef...@gmail.com

unread,

Nov 10, 2017, 2:45:17 PM11/10/17

to pystatsmodels

On Wed, Nov 8, 2017 at 3:45 PM, Damien Moore <damien...@gmail.com> wrote:

Picking this back up...

On Friday, October 27, 2017 at 3:57:50 PM UTC-4, josefpktd wrote:
just replying to two general points

On Fri, Oct 27, 2017 at 3:05 PM, Damien Moore <damien...@gmail.com> wrote:
When reading below, bear in mind that what I finally started to appreciate is that econometrics in python is more frustration free when using the patsy interface than when having to define dozens of calculated variables. This is what the new user experience should be and reducing the number of steps to get the new user there should be a priority.

I'm not coming from R, and in most cases I don't use formulas.

I don't come from R either, but in lots of cases, particularly those that a novice user is interested in (or really for anyone doing exploratory data analysis), the formula interface is better. It creates a graceful way to create a whole hosted of calculated variables without blowing up the original dataset.

And for applications where speed matters it is better to avoid formulas, both in statsmodels and in R.
Formulas are very convenient for handling interaction effects and statsmodels can provide some features only when it has the formula information.
But there are a large number of use cases where we just want to run simple estimation problems on hundreds or thousands of cases. Running formulas each time is just a waste.

Again, I'm not talking about making it impossible to do it the way you like (and I like too in some cases), just changing the default to make the more beginner friendly thing more accessible from the outset. As a fellow economist, I thought you would appreciate the potency of good defaults. :) The use cases you are focused on come up more frequently in an academic context than in applied work.

The general point is that a newbie is a newbie for a month or two, but uses it (hopefully) for years. So I'm not sacrificing code/API clarity for convenience, when there is a large enough cost of doing so,
(Same idea that code is read more often than written.)

I'm not seeing the cost in clarity. As I said above, there's a huge clarity cost that comes with the namespace rearrangement when importing statsmodels.api which every new users is told to do and that imported namespace no longer lining up with the reference docs.

Six to eight years ago there were several packages that tried lazy import. As far as I know, they all gave up. It didn't work well and was much too complicated.

For something that uses a lot of C or Cython or needs really fine grained memory management I can see that could be a problem, but it's a pretty simple monkey patch to override the importer to create a proxy object that defers the actual import until objects in it are referenced. It's getting more finegrained than this that's challenging, but for statsmodels I doubt that's necessary. The main thing would be to make sure nothing in the API actually tries to touch the internals of scipy/pandas/other big libraries on import.

Sounds to complex and dangerous to me, since we are the ones that have to fight with the problems this causes. And I don't think it will work without problems

Even if we are not as successful as before, it is still faster to import parts than everything, i.e., the statsmodels.api. This is especially important for library use and not so much for interactive work where everything is loaded once and then used for a long time.
(The main problems are scipy.stats and pandas, but I'm not giving up yet.)

I guess I'm not really seeing the cases where people are using small parts of the core. But whether or not that is true many large libraries have had to deal with this problem and none of them that I know of end up with mylib.api import convention.

The api.py are supposed to be the compromise between a reasonably (?) flat paths or name space for interactive use and the library use with full paths. In some fields researchers prefer command line scripts for which importing selectively can be important. My own development usecase is also to restart or open a new console very often.

Other effects that I haven't checked in a while is parallel computation on Windows where the interpreter needs to be started in every process. E.g. when I tested this several years ago, then anything below 10 seconds run time was faster in one process. Adding a few second import time to each process can make quite a bit of difference, given that almost all our current methods are pretty fast (not that we are always fast compared to other packages but we have very few long running algorithms.)

In comparison to other packages: I think the programs "written by statisticians for statisticians" hides too much behind the scenes, and I always had problems to discover what is hiding in the formulas.

The patsy interface is well documented so I don't think that's a concern. (But maybe a valid concern against defining log, exp, and lag operations).

My main problem in R are mixed effects specification and other multi-part formulas. They might be a nice shorthand for experts, but not for a newbie like me.

An example that just showed up

https://github.com/statsmodels/statsmodels/issues/4118

To me it looks like the trade-off between user convenience and flexibility on one hand and understandable design and easy maintenance on the other hand.

multiprocessing, pickling, patsy formulas, stateful transforms and arbitrary user transforms.

There are way too many places where something can go wrong, and most of it is out of the control of statsmodels.

Our unit tests and the examples I ran, are for "optimistic" cases but cover a tiny fraction of what is possible given the flexibility that this provides.

(Some users are like lumberjacks that can use a chainsaw. But then give all users a chainsaw and watch them cut butter. :)