volunteers? global warning level, mode interactive ?

11 views
Skip to first unread message

josef...@gmail.com

unread,
Aug 13, 2012, 9:29:58 PM8/13/12
to pystatsmodels
It's becoming more and more necessary to change the behavior of
statsmodels for different use cases.

Besides the recent rank deficient design matrix case, there is also a
discussion about checking the data
https://github.com/statsmodels/statsmodels/issues/426 and
https://github.com/statsmodels/statsmodels/pull/422 .

Some checks can be triggered by options, but many checks are currently
not done to avoid calculations and checks that are in most (read if
"properly" used) cases unnecessary.

We often discussed it, but it has never been implemented.

I still think a global option variable, like matplotlib.isinteractive
or print options and seterr in numpy is the best way to go, so we can
add checks like

from statsmodels import user_options
if user_options == 'paranoid':
check this
and that
and something else
raise ???Error("something wrong with your data")


and to turn it off

with user_option = "avoid_unnecessary_checks":
results.get_bootstrap(nrep=5000)

implementation details ? (what kind of context manager do we need?
how does it interact with explicit options?)

user_option might be just a shortcut, one of the previous discussion
was more fine-tuned in analogy to
numpy.seterr(all=None, divide=None, over=None, under=None, invalid=None)

suggestions, comments and especially pull request welcome

Josef

Fernando Perez

unread,
Aug 13, 2012, 11:17:52 PM8/13/12
to pystat...@googlegroups.com
On Mon, Aug 13, 2012 at 6:29 PM, <josef...@gmail.com> wrote:
> suggestions, comments and especially pull request welcome

Just to say that this needs to be thought very, very carefully for
anything with algorithmic implications. Libraries that have state are
borderline impossible to reason about, because all of a sudden,
changes in one location can affect the results of some other code in a
completely unexpected way:

a.py:
statsmodels.weird_option = True

b.py:
if weird_option is True:
return bogus_result

c.py:
from b import great_function
...
...
import a
print great_function(x,y,z) # WTF???


If you consider the examples you pointed out, numpy and matplotlib,
they only use these kinds of global flags very sparingly and mostly
only for things that impact either cosmetics/readability or the
interactive workflow, but not the actual results.

Numpy has the extraordinarily dangerous set_numeric_ops, that can
potentially alter the semantics of any numpy-using code in a radical
way. It does come with quite a few warnings...

It may be that you really need this for statsmodels, but I hope you
guys can really think through the implications very carefully. I have
yet to see a stateful library where I'm convinced that was a good idea
(other than cosmetic/interactive flow settings).

Just a thought from the peanut gallery...

Cheers,

f

Skipper Seabold

unread,
Aug 14, 2012, 10:45:01 AM8/14/12
to pystat...@googlegroups.com

Sure. One example of this precedent is R. You could globally set na.action to na.fail, for example. What we'd have in this case is not a bogus result and correct result depending on state. We'd have an error or the fit method works. The reason for being able to turn this on and off, is that there is some overhead with checking for missing data which we might want to avoid, if for instance, we're bootstrapping or something.

Since this is the case, I think it may make sense to allow some kind of global state, but I agree we should think about this to see if we really want it. I think what most users would prefer is the default of missing = "drop", while most developers and maybe "power" users are probably going to want to keep the current do-nothing and fail default. Global state seems the only option to placate both sets.

If we decide that we want a global state (and this would only affect things that inherit from Model, we're not proposing that functions like mean, std, etc. would be affected), then I think every Results instance should also have the state the model was fit in attached.

Skipper

Ralf Gommers

unread,
Aug 14, 2012, 2:59:34 PM8/14/12
to pystat...@googlegroups.com

Letting one of those two groups (probably power users) type 14 extra characters is also an option. For scipy.linalg for example we'll probably merge https://github.com/scipy/scipy/pull/48, adding a "check_finite" kw to a large number of functions. Seems a less bad solution that introducing global state to me.

Ralf
 

josef...@gmail.com

unread,
Aug 14, 2012, 3:16:09 PM8/14/12
to pystat...@googlegroups.com
We have some of these keywords, but I don't want to write a long list
of keyword options in every call, and add them to the signature of
every method.
exaggerated: imagine every function needs an extra set of keywords
(divide=None, over=None, under=None, invalid=None)

I worry that we will need a lot of keywords additional to what we
have, (urgent: has_const or const_idx, with related propagation
problems. Skipper's nan branch rearranges the class hierarchy to avoid
**kwds).

Skipper described the situation well. And as long as we stick with
output options 1), and check/warning/exception usage, we don't get
"strange" numbers, we might get no numbers, or numbers that we don't
care about.
Which should take care of most of Fernando's worries.

The other part is that it requires some discipline: no monkey
patching, and if you change the globals, change them back.

(Since numpy increased the default warning level, scipy is gaining
code that turns warnings off and on.)

1) we discussed making __repr__ noisy in "interactive" user mode.

Josef

Ralf Gommers

unread,
Aug 14, 2012, 3:48:23 PM8/14/12
to pystat...@googlegroups.com

Then you could create one keyword which is a dict that can contain those keywords. As long as the defaults are sane, it's not that big an issue.

George actually just implemented something similar, using a class SetDefaults which can be instantiated and passed to the kernel estimators to control behavior.

I worry that we will need a lot of keywords additional to what we
have, (urgent: has_const or const_idx, with related propagation
problems. Skipper's nan branch rearranges the class hierarchy to avoid
**kwds).

Skipper described the situation well. And as long as we stick with
output options 1), and check/warning/exception usage, we don't get
"strange" numbers, we might get no numbers, or numbers that we don't
care about.
Which should take care of most of Fernando's worries.

I'm not sure I have a good overview of all globals being proposed here. I've seen:
- missing data
- rank deficient matrices
- has_const
- changing __repr__

Is there more?
 

The other part is that it requires some discipline: no monkey
patching, and if you change the globals, change them back.

That's exactly the problem. If you're only thinking about one end user it may still be okay, but once you're in a situation with many developers of varying skill levels, you can forget about discipline.

Ralf

Fernando Perez

unread,
Aug 15, 2012, 3:40:09 PM8/15/12
to pystat...@googlegroups.com
On Tue, Aug 14, 2012 at 7:45 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> If we decide that we want a global state (and this would only affect things
> that inherit from Model, we're not proposing that functions like mean, std,
> etc. would be affected), then I think every Results instance should also
> have the state the model was fit in attached.

If at all possible, I'd ask you to consider at least encapsulating
state at the object instance level. What gives me the chills is
module-level state, because it's impossible to know whether a call you
make in one place may have modified it (and perhaps failed to revert
its modification).

And object state doesn't necessary have to imply adding a million
keywords to every method call, some of these things may be settable
only in the constructor, or directly as attributes (or properties if
additional validation is desired).

Furthermore, one way to avoid module state could be to provide all
default calls as standalone functions, along with classes that wrap
the same functionality but with non-default state. That way all
top-level functions can be called directly and don't rely on any
module state, but users who want to set different state flags can
create just one object with the new state and use its methods from
then on. This allows for convenient use of non-default state (set
once and use it, without having to pass kwargs all the time), while
leaving the module free of true global state.

Cheers,

f

Fernando Perez

unread,
Aug 15, 2012, 3:55:30 PM8/15/12
to pystat...@googlegroups.com
On Tue, Aug 14, 2012 at 12:16 PM, <josef...@gmail.com> wrote:
> The other part is that it requires some discipline: no monkey
> patching, and if you change the globals, change them back.

I'm afraid experience shows this is a bad idea: ensuring that changes
like these remain truly isolated is impossible, and all it takes is
one user making a mistake for the library state to be 'damaged' in a
way that no other code can automatically detect. That is the most
dangerous form of brittleness imaginable, and I think it should be
avoided at all costs. Either you tell users "assume that the library
has global state that can potentially change at any time, and write
your code defensively to protect against that", or you make it
impossible for that to happen.

But I think your suggestion above is akin to saying "drivers should
not make mistakes so that being on the road isn't dangerous". We
all know that's not realistic, which is why we teach "drive
defensively assuming at any time someone *else* may do something
stupid and unexpected, and even if it's their fault, *you* should be
able to react so as to protect yourself".

Cheers,

f

josef...@gmail.com

unread,
Aug 15, 2012, 4:49:55 PM8/15/12
to pystat...@googlegroups.com
Or: "If everyone else is driving a sports car, why do we need to stick
with my old Toyota Corolla."
(even if in the city traffic, sports cars don't go any faster either.)

I don't really see why we are getting more resistance than other packages.

I assume interactive sessions are single user, and if a package that
uses statsmodels like pandas leaves the globals changed, then user can
reset them and complain (file a bug report) to the package author (or
to us if we forget to do it).

class attributes:

>>> from statsmodels.base.model import LikelihoodModel
>>> LikelihoodModel.do_funny_things = True
>>> mod = OLS(endog, exog)
>>> if mod.do_funny_things: print 'ouch'
...
ouch

(which means, as you said, the only alternative is instance level)

I understand the problem with getting different numbers, but we are
discussing now just changing the warning/exception levels.
It can still bite a user, if (s)he relies on some automatic checking
that disappeared, which is not much different from the current
situation if a user ignores that (s)he might have "weird" data.
(Users would always be able to hit the breaks, and set the package
global to paranoid before running some estimation problems, i.e.
defensive users instead of defensive developers)

Josef

>
> Cheers,
>
> f

Fernando Perez

unread,
Aug 15, 2012, 5:17:12 PM8/15/12
to pystat...@googlegroups.com
On Wed, Aug 15, 2012 at 1:49 PM, <josef...@gmail.com> wrote:
> I don't really see why we are getting more resistance than other packages.

I'm sorry if the feedback wasn't helpful, my only concern is if this
pattern goes into the algorithmic parts of the library; it wasn't
clear to me (perhaps I misunderstood) that all you had in mind was
control of interactive/cosmetic features. For that, certainly a few
global config flags are perfectly OK and make sense.

In any case, I'll head back up to the peanut gallery :)

Cheers,

f

josef...@gmail.com

unread,
Aug 15, 2012, 5:26:49 PM8/15/12
to pystat...@googlegroups.com
On Wed, Aug 15, 2012 at 5:17 PM, Fernando Perez <fpere...@gmail.com> wrote:
> On Wed, Aug 15, 2012 at 1:49 PM, <josef...@gmail.com> wrote:
>> I don't really see why we are getting more resistance than other packages.
>
> I'm sorry if the feedback wasn't helpful, my only concern is if this
> pattern goes into the algorithmic parts of the library; it wasn't
> clear to me (perhaps I misunderstood) that all you had in mind was
> control of interactive/cosmetic features. For that, certainly a few
> global config flags are perfectly OK and make sense.

I was playing with the thought of doing more than the minimal
warning/exception and repr parts, but given all the warnings of you
and Nathaniel before, I think we stick with minimal usage.

>
> In any case, I'll head back up to the peanut gallery :)

Thanks for the "visit", and if we start to do dangerous things, then
additional comments and warnings are again appreciated.

Josef

>
> Cheers,
>
> f

josef...@gmail.com

unread,
Aug 15, 2012, 5:47:25 PM8/15/12
to pystat...@googlegroups.com
as example: a suggestion in the direction of Skipper's nan branch

keyword argument nan_action=None which uses the global default, for example:
check_drop: check for nan (or isfinite) and drop corresponding rows

for time series (or GLSAR), we would need to disallow check_drop as
default, and use check_raise or no_check as defaults, since in most
cases dropping an observation in a time series wouldn't make sense.

Josef


>
> Josef
>
>>
>> Cheers,
>>
>> f
Reply all
Reply to author
Forward
0 new messages