How do I preserve a model trained from statsmodel?

Jieyun Fu

unread,

Mar 9, 2012, 1:37:22 PM3/9/12

to pystatsmodels

Hi all,

What's the best way of preserving a model trained from statsmodel? For
example, say I obtain a model from the following piece of code (stolen
from the statsmodel document), what's the best way of saving variable
"model", so I don't have to refit it when I use it? model itself has
reference to the training data, so pickling will results in pickling
training data. I can just save the parameters in the RegressionResult
object ("results" in the code), but I can't build a new model based on
the parameters, so I can't use the same model APIs, such as predict().
OLS is easy to "predict" by ourselves, but for more fancy regression,
this will be non-trivial.

What will be the best way of preserving the model, so when I use it, I
can just load it from the disk, and use it as if it was just trained?
Thank you so much!

>>> import numpy as np
>>> import scikits.statsmodels.api as sm
>>>
>>> Y = [1,3,4,5,2,3,4]
>>> X = range(1,8) #[:,np.newaxis]
>>> X = sm.add_constant(X)
>>>
>>> model = sm.OLS(Y,X)
>>> results = model.fit()

josef...@gmail.com

unread,

Mar 9, 2012, 2:46:34 PM3/9/12

to pystat...@googlegroups.com

On Fri, Mar 9, 2012 at 1:37 PM, Jieyun Fu <jiey...@gmail.com> wrote:
> Hi all,
>
> What's the best way of preserving a model trained from statsmodel? For
> example, say I obtain a model from the following piece of code (stolen
> from the statsmodel document), what's the best way of saving variable
> "model", so I don't have to refit it when I use it? model itself has
> reference to the training data, so pickling will results in pickling
> training data. I can just save the parameters in the RegressionResult
> object ("results" in the code), but I can't build a new model based on
> the parameters, so I can't use the same model APIs, such as predict().
> OLS is easy to "predict" by ourselves, but for more fancy regression,
> this will be non-trivial.
>
> What will be the best way of preserving the model, so when I use it, I
> can just load it from the disk, and use it as if it was just trained?
> Thank you so much!

Unfortunately there is still no official way to do it.

We had the question once before, and I tried some workarounds or
possible solutions, but there is no general solution, yet.

One option is to delete (or assign None ? to) the training data
before pickling to make it smaller.
The second possibility is to create a new model instance and use it's
predict method.
The third possibility is to create separate objects that are just for
prediction.

The main problem or question is which information from the results and
the model instances is necessary for later usage.

I think, model.predict outside of tsa does not use any state (attached
attributes). model.predict should be a method that just works like a
function with params and exog as parameters. (maybe additional
arguments in some cases)

In this case, the first two options would work. Prediction intervals
(when available) would require some additional information.

forecasting in tsa is more difficult, because it needs historical
information to initialize the process. AR and VAR might still work but
I don't know if they are designed this way. ARMA needs the estimate of
past residuals.
I never tried whether this would work with the current implementation.

For ARMA and VAR we went partially into the third solution by having
an additional "Process" class. But this was not systematically
developed for prediction.

The main issue I have in thinking about a solution is what to throw
away and what to keep. Since models and the results are designed for
calculating results on demand, there is no clear way to tell what to
keep.

I don't think it would be too difficult, but we would have to define
the restricted use cases:

For statistical tests we essentially need the parameters, a covariance
and some degrees of freedom or nobs, k_vars information.

Prediction: I mentioned above prediction of the mean. For prediction
intervals, I think it depends on the model (In structural VAR with
bootstrap confidence intervals we cannot throw away anything, I
guess?). (possibly an issue: we still have no strict pattern whether
mean and confidence interval for prediction are produced by the same
method.)

One possibility to add this (without waiting for a general solution)
would be to add a method `shrink_for_predict` that could be called
before pickling and which throws away all the attached data that is
not used for prediction. (1st solution)
or 2nd solution: make functions that create fake models and results
instances that take the minimal information required for prediction.

Which models are you using for prediction, and which results do you
use with prediction?

I'm interested in getting a use case and see how well this would work.

Eventually we will need a proper solution, but I don't see yet how we
can change the models to work with only partial information.

Josef

Jieyun Fu

unread,

Mar 9, 2012, 3:09:02 PM3/9/12

to pystat...@googlegroups.com

Thanks for your reply. I am looking for a temporary (but somehow elegant) solution under the current framework.

- Just None-ing the training data wouldn't help much, because model at least has a reference to the "_results" object, which has the residuals stored. And we can't None out the "_results" object, because the model uses ._results.params to do prediction. :)

- We can save the params and supply "params" into the predict method of a new model instance. but building a new empty model is not possible unless we supply endog, exog when constructing a model. Putting dummy values in looks kind of weird, but that's the best solution so far, I guess.

- I believe creating a separate object for prediction is a long term solution that is not available to us yet.

I am mostly just using OLS and logistic regression. But I am hoping there can be a generic (even temporary) solution for all the models, since I might use them at any time. :)

Thanks again for your reply!

Jieyun

josef...@gmail.com

unread,

Mar 9, 2012, 3:32:20 PM3/9/12

to pystat...@googlegroups.com

This has changed in the version in development (master on github)
since _results created a circular reference we removed it. predict has
now a different signature

What I thought was not to assign None to _results, but to all
individual attached big arrays, endog, exog, wendog, wexog, resid,
wresid, fittedvalues and possible references (attributes) to those in
either model or result instance.
(But I haven't tried yet.)

>
> - We can save the params and supply "params" into the predict method of a
> new model instance. but building a new empty model is not possible unless we
> supply endog, exog when constructing a model. Putting dummy values in looks
> kind of weird, but that's the best solution so far, I guess.

That's what I tried out last time, and because I created them with
minimally sized endog and exog, the number of observations and degrees
of freedom were wrong. So, it's definitely dangerous and just a
stop-gap solution.

>
> - I believe creating a separate object for prediction is a long term
> solution that is not available to us yet.
>
> I am mostly just using OLS and logistic regression. But I am hoping there
> can be a generic (even temporary) solution for all the models, since I might
> use them at any time. :)

I will give it a try, I think it shouldn't be too difficult to get it
to work for the main models outside of tsa.

Josef

josef...@gmail.com

unread,

Mar 9, 2012, 4:47:17 PM3/9/12

to pystat...@googlegroups.com

I tried the basic idea of wiping data arrays, but it's a lot more
difficult than I expected.
reduces pickle size from 2300 KB to 2KB, see code below

I didn't call any cached properties before pickling, so none of them
are additionally created and stored, and they won't be available after
unpickling.

The problems with the statsmodels development version
1) the results instance is actually an instance of the wrapper class
that isn't picklable (at least not in my example)
-> pickle results._results (the unwrapped instance) instead
2) now models have an additional _data attribute that holds additional
references to the data

essentially, I just looked at the pickle file and chased down every
array with introspection, dir(...), and set it to None.
I'm not familiar enough with the wrapper machinery to understand or
guess some of the consequences of doing it this way.

It works for prediction in the example, but needs a lot more checking.

Josef

-------------

import numpy as np
import scikits.statsmodels.api as sm

nobs = 10000
np.random.seed(987689)
x = np.random.randn(nobs, 3)
x = sm.add_constant(x, prepend=True)
y = x.sum(1) + np.random.randn(nobs)

xf = 0.5 * np.ones((2,4))

model = sm.OLS(y, x)
results = model.fit()

print results.predict(xf)
print results.model.predict(results.params, xf)

results._results.model.endog = None
results._results.model.wendog = None
results._results.model.exog = None
results._results.model.wexog = None
results.model._data._orig_endog = None
results.model._data._orig_exog = None
results.model._data.endog = None
results.model._data.exog = None
#results.model._data = None

results._results.model.fittedvalues = None
results._results.model.resid = None
results._results.model.wresid = None
#extra
results._results.model.pinv_wexog = None

import pickle
fh = open('try_shrink2.pickle', 'w')
pickle.dump(results._results, fh) #pickling wrapper doesn't work
fh.close()
fh = open('try_shrink2.pickle', 'r')
results2 = pickle.load(fh)
fh.close()
print results2.predict(xf)
print results2.model.predict(results.params, xf)
-------------

josef...@gmail.com

unread,

Mar 10, 2012, 8:24:02 AM3/10/12

to pystat...@googlegroups.com

I tried out discrete.Poisson

In the master version a result instance currently doesn't pickle,
there is a branch where Skipper started to fix it
https://github.com/statsmodels/statsmodels/compare/master...pickle
https://github.com/statsmodels/statsmodels/issues/95

Using a work around
results._results.mle_settings['callback'] = None
before pickling, using the same setting data arrays to None works,
pickle file size went from 1300 to 3 KB.

So it looks like a `shring_to_predict` or `remove_data` (?) method
could be made to work.

I never looked at pickling closely enough to tell, why the callback
and some other parts don't pickle.

Josef

Jieyun Fu

unread,

Mar 10, 2012, 10:35:45 AM3/10/12

to pystat...@googlegroups.com

I guess the take-away here is that, preserving the trained-model is really model-specific. Users have to take care of it themselves.

I look very forward to having a generic solution!

josef...@gmail.com

unread,

Mar 10, 2012, 10:53:05 AM3/10/12

to pystat...@googlegroups.com

It's not so bad, to go from OLS to Poisson I only had to add one line
(but because of the pickling problems it took me some time to find
that line :))

Because many of the properties and attributes are inherited and for
others the naming is mostly consistent, most of the code will be the
same across models (outside of tsa). Some models have additional
attributes, Poisson has offset and exposure that I didn't use in my
example. GLM has some additional residuals arrays that I don't
remember when they are created.

If we add a method for this, then most of the code (setting attributes
to None) can be inherited, but some models might need a few extra
lines (and maybe we need to check whether a model is not missing some
arrays).

For users it would be just a call in this case

results.shrink_to_predict() or
results.drop_data()

I'm not sure what's a good name.

Josef

josef...@gmail.com

unread,

Mar 10, 2012, 2:54:28 PM3/10/12

to pystat...@googlegroups.com

I opened an issue https://github.com/statsmodels/statsmodels/issues/176
and started a branch to work on this
https://github.com/josef-pkt/statsmodels/commits/remove_data

I added a method `remove_data` to the Results in linear_model and in
discrete_model

https://github.com/josef-pkt/statsmodels/commit/b3e33035699c8c5a7d8e5bc8290cf6800740c1af

Jieyun, can you look at the example and see if this would work for you?

It will still take some work to get this into a reasonably general and
robust form.

Josef

Jieyun Fu

unread,

Mar 10, 2012, 8:25:17 PM3/10/12

to pystat...@googlegroups.com

Thanks Josef. It would work for me. However, does it make more sense to just have a "preserve()" method and "load()" method, instead of remove_data()? This makes the APIs a lot cleaner. I don't see a reason of removing the data of a fitted model other than preserving it.

josef...@gmail.com

unread,

Mar 10, 2012, 9:07:22 PM3/10/12

to pystat...@googlegroups.com

I'm not sure what you mean with preserve.

Skipper has in his pickle branch a save() and a load() method, that
would do the pickling. (And with tests we would be sure not to break
it again, or keep track of which parts are not picklable.)

remove_data could be an option in the save() or save_pickle() method.
So if this works out, a user wouldn't need to call remove_data
himself. But I think save would need both options, pickle full and
pickle shrinked instances.

But we need a method internally or publicly that knows how to remove
the data from a result and model instance, so it would be partially
independent from the user interface.

My latest version is more generic now, and can be inherited.

I'm now also setting to None those arrays that are created, cached
attributes like fittedvalues, resid, wresid. They are for example
created by a call to summary()

They are not necessary for predict. But I don't know if a user
wouldn't want to keep them anyway.

Right now, I still keep all information about the fit statistics that
have been created already, before calling remove_data, the largest of
those would be parameter covariance matrix with shape (k_vars,
k_vars). If the number of exog variables is not huge, then this
doesn't require much space.

-----------
I think some other use cases will come up that might take advantage of
remove_data (maybe with different options).

For example, I'm running many regression in a loop to get statistics
from leave-one-out regression. similar case is bootstrap.

Currently, I collect the relevant results into a list because it would
waste a lot of memory to keep full instances around.
If instances are shrunk, then I could keep them around and they would
be available for follow up analysis that only requires the stored
parameters and cov_params.

Josef

Jieyun Fu

unread,

Mar 11, 2012, 8:29:03 PM3/11/12

to pystat...@googlegroups.com

Wasn't aware that save() and load() were already in the plan. As long as remove_data is an option in those methods, I am fine with it. Thanks again Josef.

josef...@gmail.com

unread,

Mar 12, 2012, 11:17:28 PM3/12/12

to pystat...@googlegroups.com

remove_data is mostly finished
https://github.com/statsmodels/statsmodels/pull/178

still needs to coordinate with pickle branch and a dump/save/pickle method.

Skipper, do you want to check the implementation?
it's pretty spread out, but the main information what to delete are
in lists of attribute names in model and results. __init__. That can
be easily changed, even just before calling remove_data. Some special
cases are (still) hard wired.

Josef

Nathaniel Smith

unread,

Mar 13, 2012, 8:46:43 AM3/13/12

to pystat...@googlegroups.com

On Tue, Mar 13, 2012 at 3:17 AM, <josef...@gmail.com> wrote:
> remove_data is mostly finished
> https://github.com/statsmodels/statsmodels/pull/178
>
> still needs to coordinate with pickle branch and a dump/save/pickle method.
>
> Skipper, do you want to check the implementation?
> it's pretty spread out, but the main information what to delete are
> in lists of attribute names in model and results. __init__. That can
> be easily changed, even just before calling remove_data. Some special
> cases are (still) hard wired.

This is an awesome addition; I've run into exactly this problem plenty
of times in R when bootstrapping. Thanks!

The .remove_data() approach sort of... smells funny[0] to me, though.
We end up with an object that can be in two rather different states,
with a bunch of fields and attributes that are either there or not,
which sounds really annoying maintenance-wise. And the consequences
for the user are pretty obscure. ("Is this a real results object or
not? Which methods will actually work?") Maybe it'd be worth thinking
a bit about the modeling (in the program design sense) here?

[0] http://c2.com/xp/CodeSmell.html

I feel like there are a few logically distinct objects that are mixed
up together. Maybe the right schema is something like:
-- "The problem description" -- basically what you get now when you
call sm.OLS() or analogues. I'm not sure what the name for this is. A
description of the data, the fitting method to be used, etc.
-- "The model" -- in the sense that stats textbooks mean the term.
This is new. Coefficients and basis functions and that sort of thing,
but only that. Enough to compute likelihoods and predictions on new
data, but it doesn't care where the model came from (whether it was
fit by ML, or by MM, or just typed in).
-- "The fit results" -- An object that wraps up all the information
from actually running a fit. Has references to both of the above
objects, and is responsible for any operations that require mixing
both of them. E.g., this is where you have your rss and t values, and
this object has a predict() method that delegates to the model's
predict method, but adds the data from the problem description as a
default argument.

Currently the results object is a amalgamation of the latter two
things. If they were separated, then instead of result.remove_data();
pickle(result) people could just say pickle(result.model), and
everything's clearer conceptually.

What do you think, would that work/make sense/be better?

-- Nathaniel

josef...@gmail.com

unread,

Mar 13, 2012, 10:29:57 AM3/13/12

to pystat...@googlegroups.com

On Tue, Mar 13, 2012 at 8:46 AM, Nathaniel Smith <n...@pobox.com> wrote:
> On Tue, Mar 13, 2012 at 3:17 AM, <josef...@gmail.com> wrote:
>> remove_data is mostly finished
>> https://github.com/statsmodels/statsmodels/pull/178
>>
>> still needs to coordinate with pickle branch and a dump/save/pickle method.
>>
>> Skipper, do you want to check the implementation?
>> it's pretty spread out, but the main information what to delete are
>> in lists of attribute names in model and results. __init__. That can
>> be easily changed, even just before calling remove_data. Some special
>> cases are (still) hard wired.
>
> This is an awesome addition; I've run into exactly this problem plenty
> of times in R when bootstrapping. Thanks!
>
> The .remove_data() approach sort of... smells funny[0] to me, though.
> We end up with an object that can be in two rather different states,
> with a bunch of fields and attributes that are either there or not,
> which sounds really annoying maintenance-wise. And the consequences
> for the user are pretty obscure. ("Is this a real results object or
> not? Which methods will actually work?") Maybe it'd be worth thinking
> a bit about the modeling (in the program design sense) here?
>
> [0] http://c2.com/xp/CodeSmell.html

It's a bit smelly, and not like Camembert

The advantages are that I could code it in a day or two, exceptions
are kind of obvious (I don't think it can produce wrong numbers), and
it could be extended to custom stripping

>>> res
<statsmodels.discrete.discrete_model.CountResults object at 0x04884730>
>>> res.fittedvalues
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-josef_new\statsmodels\tools\decorators.py",
line 85, in __get__
_cachedval = self.fget(obj)
File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-josef_new\statsmodels\discrete\discrete_model.py",
line 1530, in fittedvalues
return np.dot(self.model.exog, self.params)
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

looping over attributes [i for i in dir(results)] raises an exception

I would consider it only for restricted use with a specific purpose in
mind. Similar, when running a bootstrap or MonteCarlo we have to
decide in advance what to keep and what to throw away, unless we want
to blow our memory or work on tiny problems.

>
> I feel like there are a few logically distinct objects that are mixed
> up together. Maybe the right schema is something like:
> -- "The problem description" -- basically what you get now when you
> call sm.OLS() or analogues. I'm not sure what the name for this is. A
> description of the data, the fitting method to be used, etc.
> -- "The model" -- in the sense that stats textbooks mean the term.
> This is new. Coefficients and basis functions and that sort of thing,
> but only that. Enough to compute likelihoods and predictions on new
> data, but it doesn't care where the model came from (whether it was
> fit by ML, or by MM, or just typed in).
> -- "The fit results" -- An object that wraps up all the information
> from actually running a fit. Has references to both of the above
> objects, and is responsible for any operations that require mixing
> both of them. E.g., this is where you have your rss and t values, and
> this object has a predict() method that delegates to the model's
> predict method, but adds the data from the problem description as a
> default argument.
>
> Currently the results object is a amalgamation of the latter two
> things. If they were separated, then instead of result.remove_data();
> pickle(result) people could just say pickle(result.model), and
> everything's clearer conceptually.
>
> What do you think, would that work/make sense/be better?

I think a refactoring or enhancements along those line would be the
better longterm solution.
We briefly discussed variations on this on the mailinglist a few times.

The main problem is to figure out what is supposed to be in your "The
model", and where the boundaries are especially with "The problem
description".
Also if you want to predict (with standard errors), then again we need
the "The fit results" which is fully loaded.

I think it can be done and I would like to see more outsourcing or
mixins, however this requires some experimentation with the design and
refactoring that nobody has done yet.

examples:
For ARMA I wrote a process class, that is used only as helper function
and for generating random processes. VAR is split up into several
classes (but I'm not familiar with all the boundaries).

It would be nice to have a NormalLinearModel as a mixin or standalone
functions that just define likelihood and predict, and ??? (but the
equivalent to OLS is just a few lines)

For models with non-normal distribution (GLM, discrete), I would like
to have a `generate_sample` or `rvs` method to quickly get some test
example or for Monte Carlo, or as variation for parametric bootstrap
I wrote random sample for the tests for Poisson and Logit, but
couldn't make up my mind yet whether and where it should be attached
as a method.

Also for the non-normal distribution, I would find it useful to get
out a distribution instance, so we can use the full distribution for
prediction,
e.g. predict_distr = lambda xf; scipy.stats.poisson(result.predict(xf))
but we would need to write distribution classes for MNLogit and Probit
for example, I think.

The way we could move in this direction is to start with one model
class, e.g. discrete_model, and see how we can refactor the class
structure to make different specialized use cases easier, without
getting into the smelly stuff.

Thanks for the feedback,

Josef

>
> -- Nathaniel

Nathaniel Smith

unread,

Mar 15, 2012, 5:08:43 PM3/15/12

to pystat...@googlegroups.com

I'm not sure what you mean here -- perhaps for certain non-parametric
models (e.g., locally weighted regression or kernel density estimates)
you need to keep the original data, but that's fine, that's just the
nature of those models. They should stash the actual data in the
"model" object (or if we want to really get fancy then for a lot of
cases some numerical summary might be sufficient -- e.g., evaluating
your kernel density on a grid and saving that may let you reduce
memory requirements by orders of magnitude without any real loss in
precision. But this is a tangent...).

What are you thinking of? For basic regression setups, you don't need
the original data to make predictions on new data. For something like
ARMA you need some context, but you can use any given ARMA model to
make predictions from arbitrary context, not just the original data.
So again it makes sense for the model to just have the ARMA
coefficients and to have a predict method that requires context be
provided, and the results object can fill in the original data by
default.

> I think it can be done and I would like to see more outsourcing or
> mixins, however this requires some experimentation with the design and
> refactoring that nobody has done yet.
>
> examples:
> For ARMA I wrote a process class, that is used only as helper function
> and for generating random processes. VAR is split up into several
> classes (but I'm not familiar with all the boundaries).
>
> It would be nice to have a NormalLinearModel as a mixin or standalone
> functions that just define likelihood and predict, and ??? (but the
> equivalent to OLS is just a few lines)

To tell the truth, I have a bias against using inheritance in general,
including (maybe especially) mixins... IME every single time I've ever
considered using inheritance, it turned out that delegation worked
better. I'm sure there must be some exceptions to this rule, or
academic language designers wouldn't make such a fuss about
inheritance... OTOH this may be overly optimistic on my part :-).

Inheritance is the literal equivalent to doing "import *". It creates
very strong coupling between code in different places, and breaks the
interface boundaries that make maintenance possible. Looking at some
code in a superclass and trying to figure out how a change will affect
all the subclasses makes my head hurt. So if I need similar
functionality in multiple classes, I just call helper functions or
delegate to a utility object. YMMV of course.

To be clear in this case I'm suggesting that a "results" object HAS-A
"model" object, not IS-A "model" object.

> For models with non-normal distribution (GLM, discrete), I would like
> to have a `generate_sample` or `rvs` method to quickly get some test
> example or for Monte Carlo, or as variation for parametric bootstrap
> I wrote random sample for the tests for Poisson and Logit, but
> couldn't make up my mind yet whether and where it should be attached
> as a method.

This would be fantastic. A similar feature would be the ability to
compute (log-)likelihoods for arbitrary new data. E.g., the Vuong test
lets you do non-nested model comparison for misspecifed models (!),
but you need to be able to get the likelihood of each individual data
point, which is generally impossible for R models... (R models have a
standard method to get the total log-likelihood for all of the
original data, which is enough to compute things like AIC but doesn't
help here.)

> Also for the non-normal distribution, I would find it useful to get
> out a distribution instance, so we can use the full distribution for
> prediction,
> e.g. predict_distr = lambda xf; scipy.stats.poisson(result.predict(xf))
> but we would need to write distribution classes for MNLogit and Probit
> for example, I think.
>
> The way we could move in this direction is to start with one model
> class, e.g. discrete_model, and see how we can refactor the class
> structure to make different specialized use cases easier, without
> getting into the smelly stuff.

Yeah, that'd definitely be better than my armchair architecting :-).
But, wanted to throw the idea out there...

-- Nathaniel

josef...@gmail.com

unread,

Mar 15, 2012, 10:38:45 PM3/15/12

to pystat...@googlegroups.com

The main problem I'm thinking of is lazy evaluation if we don't know
yet what might need later on.

For any specific usecase, I can (try to) figure out what the
corresponding sufficient statistics are, but for a general purpose
model/results structure, we would either need to calculate everything
before we throw away the data or we would be left with a reduced,
special purpose class.
In the current case I'm throwing away everything that is not needed
for prediction, calculate expected value for new exog.
For inference on parameters I could also throw away everything except
parameter estimates and covariance matrix and degrees of freedom in
the case of t distribution. But if I want to run some extra diagnostic
tests, I need the data again.

Specifically for ARMA, it's possible to use the arima process class
that only uses the parameters for the lag-polynomials. But if you want
bootstrap standard errors, for example for the impulse response
function as in structural VAR, then you are back to wanting the data.

The current "hold on to everything" is convenient, but if there are
well specified usecases with smaller requirements it would be possible
to specialize on sufficient statistics.

>
>> I think it can be done and I would like to see more outsourcing or
>> mixins, however this requires some experimentation with the design and
>> refactoring that nobody has done yet.
>>
>> examples:
>> For ARMA I wrote a process class, that is used only as helper function
>> and for generating random processes. VAR is split up into several
>> classes (but I'm not familiar with all the boundaries).
>>
>> It would be nice to have a NormalLinearModel as a mixin or standalone
>> functions that just define likelihood and predict, and ??? (but the
>> equivalent to OLS is just a few lines)
>
> To tell the truth, I have a bias against using inheritance in general,
> including (maybe especially) mixins... IME every single time I've ever
> considered using inheritance, it turned out that delegation worked
> better. I'm sure there must be some exceptions to this rule, or
> academic language designers wouldn't make such a fuss about
> inheritance... OTOH this may be overly optimistic on my part :-).
>
> Inheritance is the literal equivalent to doing "import *". It creates
> very strong coupling between code in different places, and breaks the
> interface boundaries that make maintenance possible. Looking at some
> code in a superclass and trying to figure out how a change will affect
> all the subclasses makes my head hurt. So if I need similar
> functionality in multiple classes, I just call helper functions or
> delegate to a utility object. YMMV of course.

I think we are benefitting a lot from our inheritance structure, but
we are not using multiple inheritance (yet) and mix-ins only in 1 or 2
non-central cases.
Our class inheritance is essentially a general to special tree,
likelihoodmodel, discretemodel, binomialmodel, [Logit, Probit] (or
something like this)

Adding code in a superclass makes the enhancement immediately
available in subclasses, but it's not always easy to decide where
something should go or to find it again.

>
> To be clear in this case I'm suggesting that a "results" object HAS-A
> "model" object, not IS-A "model" object.

Our result instances "have" models, but very tightly coupled. But if
you just want inference on parameters, you could do result.model=None,
(except for GLM where the result instance has aliases to the endog and
exog)
But you cannot delegate if your agent has gone MIA or AWOL.

>
>> For models with non-normal distribution (GLM, discrete), I would like
>> to have a `generate_sample` or `rvs` method to quickly get some test
>> example or for Monte Carlo, or as variation for parametric bootstrap
>> I wrote random sample for the tests for Poisson and Logit, but
>> couldn't make up my mind yet whether and where it should be attached
>> as a method.
>
> This would be fantastic. A similar feature would be the ability to
> compute (log-)likelihoods for arbitrary new data. E.g., the Vuong test
> lets you do non-nested model comparison for misspecifed models (!),
> but you need to be able to get the likelihood of each individual data
> point, which is generally impossible for R models... (R models have a
> standard method to get the total log-likelihood for all of the
> original data, which is enough to compute things like AIC but doesn't
> help here.)

To get the likelihood for new data, we need to create a new instance,
but I'm all in favor of having log-likelihood (and derivative) for
individual observations (of training data). I just added them to
discrete_models (and most of my code defines loglikeobs). (target
White's specification test based on different covariance estimates, or
Huber robust standard errors).

I didn't look at the details of Vuong's test yet. (I added two other
non-nested tests but only for linear model).
I didn't come across calculating the likelihood for new, out of sample
observations. The current compare_ftest and compare_lr for nested
models takes a second model instance to test against.

>
>> Also for the non-normal distribution, I would find it useful to get
>> out a distribution instance, so we can use the full distribution for
>> prediction,
>> e.g. predict_distr = lambda xf; scipy.stats.poisson(result.predict(xf))
>> but we would need to write distribution classes for MNLogit and Probit
>> for example, I think.
>>
>> The way we could move in this direction is to start with one model
>> class, e.g. discrete_model, and see how we can refactor the class
>> structure to make different specialized use cases easier, without
>> getting into the smelly stuff.
>
> Yeah, that'd definitely be better than my armchair architecting :-).
> But, wanted to throw the idea out there...

It's sometimes good to hear about a view from further away,
"you can see the trees or the forest but it's difficult to see them at
the same time" (or whichever way the saying goes)

Thanks,
Josef

>
> -- Nathaniel

Skipper Seabold

unread,

Mar 16, 2012, 10:16:53 PM3/16/12

to pystat...@googlegroups.com

Thinking a bit on this. What do we have, what are the goals, what are
the given use cases. Right now I hear this, which I hope adds some
concreteness and distinct separation to ideas already put forward. I
don't think there's anything new here.

Model : data and its (assumed) generating process
- it has methods to produce a fitted results object
- this allows us to assess the the appropriateness of our
DGP assumptions or to do prediction
- it has methods for prior predictive analysis (generate_rvs) (we're
not there yet, but predict is a start)
- to answer the question could our model assumptions
generate our observed data (more Bayesian)
- could be used for pedagogical reasons / pure statistical reasons
(RNG)

This model informs (at least) two distinct object (that may need to
share information)

Results : this is for diagnostics / model checking. This is a fitted model.
- residuals
- test statistics (on residuals / fitted values)
- parameters

Predictor : this is for prediction / model usage.
- parameters and a sense of their uncertainty
(standard errors/var-covariance and a distribution)
- (for, say ARMA, we'll also need the model order which is a DGP
assumption, for better or worse)
- what else do you want this for stripped down class for?

So, while I'm fine with remove_data or whatever we have now, a
long-term solution, I think is a get_predictor method of Results that
returns a pickleable Predictor class for a given model. Delegation or
inheritance, though the former sounds right to me at the moment. The
predictor class will need to keep the model formula (for
transformation) and the results given above unless I've missed
anything. I think this might clean up some of the smell. We won't
"None out" any of the data, we just won't attach to the Predictor
object. And we won't have zombie methods attached to an object in a
different state than is usual/"expected".

Aside, I think saving results from a bootstrap is a separate issue.
The commonality is, you want a stripped down results object. But what
you want to do with it is different - you're still assessing goodness
of fit / measuring uncertainty. That's for another discussion but I
don't think the solution would need to be that different.

> Specifically for ARMA, it's possible to use the arima process class
> that only uses the parameters for the lag-polynomials. But if you want
> bootstrap standard errors, for example for the impulse response
> function as in structural VAR, then you are back to wanting the data.
>

I still don't think we've thought out well what the separation is
between Process and Model for the TSA case. Given my distinctions
above, there isn't one and the process should maybe be part of the
model. You can either explore or fit with the same object. But this
raises the point that we should be able to instantiate models without
data, which I've argued against in the past. I still need to think
more on this. Maybe inheritance is right here with results delegating
some to Process (eg., invertibility/stability of roots).

My $.02,

Skipper

josef...@gmail.com

unread,

Mar 16, 2012, 11:51:52 PM3/16/12

to pystat...@googlegroups.com

The problem is a bit like tags versus directories (gmail compared to
my desktop email program)
We have a few big directories to store things, while what we would
need is a tag system that categorizes methods by purpose and usage.

some pieces

In terms of DGP process-model, we don't have so many yet,
linear model : GLS, RLM, GLM
various discrete with common models in discrete_model and GLM
DGPs in tsa: AR, ARMA, VAR
and some variation
(maybe split between mean process and variance process, or combined,
multistage processes)

estimators with data
least squares
MLE
iterative MLE in GLM
robust iterative (RLM)
tsa combines estimators in one model
more ... (GMM/GEE)

full results for exploration (current result classes)

stripped results for single purpose (like prediction, tests on parameters)

and then we have or need metainformation: varnames and wrappers in
_data, and formula (what's the interpretation of the design matrix)
---

prediction looks like it only needs the process-model (including
predict method) and stripped results for uncertainty

I think splitting the process-model from the estimation-model might
help in separating out prediction, generate_random_sample and allow
reuse by mixing and matching pieces of process and estimators.
GLM sounds a bit this way with lots of model specific information
delegated to the families.

But when I think about implementing it, I'm getting a bit doubtful
again. Deciding on boundaries might not be easy. And delegation will
in many cases lengthen the signature of any methods or functions. The
process-model doesn't have data but should know how to operate on it.
The estimator is just an __init__ and a fit method. ?
(And everything is much more complicated in tsa)

A smep and some worked out examples before any big shifting/refactoring ?

Josef

>
> My $.02,
>
> Skipper

Reply all

Reply to author

Forward