Same model different data?

Matthew Brett

unread,

Mar 22, 2011, 8:54:34 PM3/22/11

to pystatsmodels

Hi,

We're just trying to port our stuff to use pystatsmodels instead of
our old version of scipy.stats.models.

One thing we need to do is to fit the same model for many Ys.

This makes us want to set the response variable (exog) after
initializing the model.

However, it looks like we can't do that, at least:

M = OLS(X)

gives an error as it tries to whiten None:

167 self.wendog = self.whiten(self.endog)
168 # overwrite nobs from class Model:

--> 169 self.nobs = float(self.wexog.shape[0])
170 self.df_resid = self.nobs - rank(self.exog)
171 # Below assumes that we have a constant

AttributeError: 'NoneType' object has no attribute 'shape'

Is there any way of attaching the exog after creating the model?

Can exog be a matrix instead of a vector - i.e multivariate regression, with

Y = X . B + E

and Y is shape (n, q) X is (n, p), B is (p, q) E is (n, q).

We just need to be able to fit models like this, we don't need all the
classical tests for these results, though we will want to
be able to compute q F or t-statistics based on a contrast matrix C of
shape (k,p) (if k=1, this results in q T-statistics, if k > 1, they
will be F-statistics).

Our application is brain imaging where q is >> 1000

Thanks a lot,

Matthew, Jonathan

josef...@gmail.com

unread,

Mar 22, 2011, 9:55:54 PM3/22/11

to pystat...@googlegroups.com

On Tue, Mar 22, 2011 at 7:54 PM, Matthew Brett <matthe...@gmail.com> wrote:
> Hi,
>
> We're just trying to port our stuff to use pystatsmodels instead of
> our old version of scipy.stats.models.
>
> One thing we need to do is to fit the same model for many Ys.
>
> This makes us want to set the response variable (exog) after
> initializing the model.
>
> However, it looks like we can't do that, at least:
>
> M = OLS(X)
>
> gives an error as it tries to whiten None:
>
> 167 self.wendog = self.whiten(self.endog)
> 168 # overwrite nobs from class Model:
>
> --> 169 self.nobs = float(self.wexog.shape[0])
> 170 self.df_resid = self.nobs - rank(self.exog)
> 171 # Below assumes that we have a constant
>
>
> AttributeError: 'NoneType' object has no attribute 'shape'
>
> Is there any way of attaching the exog after creating the model?

No, not possible. All current models require the full data set in the init.

Which version are you using? For some cases exog is optional, and then
it can be None, but I'm not sure in which cases this applies.

>
> Can exog be a matrix instead of a vector - i.e multivariate regression, with
>
> Y = X . B + E
>
> and Y is shape (n, q) X is (n, p), B is (p, q) E is (n, q).
>
> We just need to be able to fit models like this, we don't need all the
> classical tests for these results, though we will want to
> be able to compute q F or t-statistics based on a contrast matrix C of
> shape (k,p) (if k=1, this results in q T-statistics, if k > 1, they
> will be F-statistics).

I just found out by accident that OLS with a 2d Y works to get params,
but it will break soon later.

If you can fit all Y into memory, so that you can do one big
multidimensional regression, then this is a case that we have only
partially covered, VAR and SUR (vector autoregressive and seemingly
unrelated regression have this structure) follow this pattern, but SUR
is still in the sandbox.
This would be the most efficient way, but I'm not sure about the
details. Do you want correlation across different y? I think with SUR
would be q*q square, but the parameter estimates would be independent
of this.

Another way that might work is to us your own model class that holds
on to the design matrix, but returns a standard regression result.
Almost all the result statistic, f_test, t_test and so on, are in the
results class that you would still get.

Essentially, since nobody mentioned this case (reusing the design
matrix) since statsmodels became its own package, we haven't thought
more about this.

>
> Our application is brain imaging where q is >> 1000

how large is n?

Josef

Wes McKinney

unread,

Mar 22, 2011, 10:30:28 PM3/22/11

to pystat...@googlegroups.com

Note that scipy.stats.models worked something like:

model = OLS(X)
result = model.fit(Y)

I'm curious if we couldn't emulate this functionality, since there are
many things which can be reused (e.g. normalized_cov_params)

Jonathan Taylor

unread,

Mar 22, 2011, 11:16:08 PM3/22/11

to pystat...@googlegroups.com, Wes McKinney

model = OLS(X)
result = model.fit(Y)

I'm curious if we couldn't emulate this functionality, since there are
many things which can be reused (e.g. normalized_cov_params)

That's exactly what I had in mind when I wrote scipy.stats.models -- knowing normalized_cov_params, I can
compute a different residual sum of squares for each of the q models, then efficiently compute all q t-statistics.

I think that the original code to compute F-statistics also was written to handle the computation of several F-statistics.
Specifically, if we had an estimate of the parameters B with shape (p,q) and a contrast matrix C (k,p), then the numerator of the
F statistic is something like

n, p = X.shape
pinvX = np.linalg.pinv(X)
B = np.dot(pinvX, Y)
SSE = np.sum((Y - np.dot(X, B))**2, 0)
XTXinv = np.dot(pinvX.T, pinvX)
covC_inv = np.linalg.inv(C, np.dot(XTXinv, C.T)
tmp1 = np.dot(C, B) # shape = (k,q)
tmp2 = np.dot(covC_inv, tmp1) # shape = (k,q)
numerator = (tmp1 * tmp2).sum(0) # shape = (q,)

F = (numerator / k) / (SSE / (n - p)

-- Jonathan

--

Jonathan Taylor
Dept. of Statistics
Sequoia Hall, 137
390 Serra Mall
Stanford, CA 94305
Tel: 650.723.9230
Fax: 650.725.8977
Web: http://www-stat.stanford.edu/~jtaylo

josef...@gmail.com

unread,

Mar 22, 2011, 11:26:14 PM3/22/11

to pystat...@googlegroups.com

My first impression: too messy.
We very briefly discussed an "update" method a long time ago.

We currently have more use cases where we keep the same endog, y, but
change exog, X, internally, e.g. for model selection. And it's already
sometimes difficult to keep track of it.

The big advantage in the current design is that we don't have to worry
much about wiping any cache, intermediate results correctly, If we
want to fit a different model then we create new instances of models
and results. This makes reuse more difficult, on the other hand we
don't have to worry about stale data from a previous estimation.
(My favorite again, scipy.stats.distribution, reusing instances
instead of creating new instances. - what a pain, it took me months to
figure out where the "persistent" state spillovers were coming from.)

It's possible, but I would create new classes or new methods for it to
keep the "init - fit -result" path simple and understandable.
(Similar, we don't have any structure for online or incremental
fitting, which I guess should also be separate classes, although we do
have dynamic in vector_ar.)

Since it is a big conceptional change, I would like to see if the use
of this is large enough across models to change the existing
structure. (OLS is just a special case.)
But I don't think it would be difficult to add a DeferredOLS class and
a MultiEndogOLS (ugly name). The latter might just be a simplified
version of SUR, and would be computationally more efficient than
looping through q>1000 endogs.

Josef
(I got my new notebook, and still have problems using it, or finding things)

josef...@gmail.com

unread,

Mar 23, 2011, 12:05:39 AM3/23/11

to pystat...@googlegroups.com

In this case, *should* F be a scalar or a vector, for independent
F-tests it should be a vector, but if this is interpreted as a
cross-equation restriction in a system of equation model, then it
should be a scalar.

Currently, t_test allows for multiple hypothesis, but the f_test is
always interpreted as a single joint hypothesis, although I'm not
always sure what the results are if unexpected dimensions are fed to a
function.

Your example is the basis of a MultipleOLS class, but I don't think it
will fit in the current regression (OLS to GLS) hierarchy, which
assume that there is a single model. Or maybe there would be
equivalent methods, maybe defining a sandwich estimator for this case,
for example, wouldn't be impossible. loglikelihood I guess would be a
vector of independent components, and so on.

Josef

josef...@gmail.com

unread,

Mar 23, 2011, 12:31:01 AM3/23/11

to pystat...@googlegroups.com

June 19, 2009 (nipy-devel), but we have come pretty far since then.

josef...@gmail.com

unread,

Mar 23, 2011, 10:58:35 AM3/23/11

to pystat...@googlegroups.com

Two more comments:

our top level LikelihoodModelResults class has a relatively clean
interpretation as either a multivariate Normal or multivariate T
distribution for a random vector with mean params, variance
cov_params, and in the case of the t-distribution the corresponding
degrees of freedom. Pretty much everything is designed for this case.
vector_ar.VAR doesn't inherit from it but has a similar structure,
vectorized (raveled) params and the corresponding 2d cov_params
matrix.

From what I can see the regression classes GLS, WLS, OLS, attach the
data but the only regression results are attached as results.instance.
So I think something like the following should work

mod = OLS(y[:,0], exog)

results = []
for idx in arange(y.shape[1]):
endog = y[:,idx]
mod.wendog = mod.whiten(endog)
res = mod.fit()
results.append((res.params, res.tvalues, res.f_test(C)))

As I wrote this I realized that I don't think I can do
`results.append(res)` in the loop

Currently we don't make any copies of any arrays, and the result
instance is holding on to the model instance to get additional (model
specific) methods and the data.
If we change the data, e.g. endog y, of a model instance, then the
data for the result instance will be based on the new (wrong) data.
So, also all our result classes rely on the "each different model is
a different instance" assumption.
The main design choices for this were no copying and lazy evaluation.

The only safe way to reuse an existing design matrix is by creating a
new model instance that explicitly takes a reference to an existing
design matrix, pinv(x), ...

Josef

Jonathan Taylor

unread,

Mar 23, 2011, 12:39:07 PM3/23/11

to pystat...@googlegroups.com, josef...@gmail.com

Yes, it is a bit messy to keep track of state changes. I think we can change where we use that code to pass in the data
on init. Besides changing X (I forget if you call this endog or exog) for, say, parametric or residual boostrapping, one would
also want to change Y... One would not want to have to recompute the pseudoinverse each time.

The multiple F statistics should be thought of as q different hypotheses, with an F-statistic for each hypothesis.
I agree that it may not be the most common occurence but there are several instances in "high-throughput" data where
one wants to compute an F-statistic for each column of a response data matrix Y (shape (n,q)). That little snippet of code
I wrote can almost be implemented directly on the LikelihoodModelResults as long as we can fit the model
with an (n,q) data matrix.

The model with an (n,q) data matrix and an (n,p) design matrix is known in statistics as a multivariate regression model (which admittedly sounds an awful lot like a multiple linear regression model). See Chapter 6 in http://www.amazon.com/Multivariate-Analysis-Probability-Mathematical-Statistics/dp/0124712525 (note they switch the p and q in the shapes...)

Jonathan

Matthew Brett

unread,

Mar 23, 2011, 5:41:52 PM3/23/11

to pystat...@googlegroups.com, josef...@gmail.com

n would commonly be > 100K, and can be much more than this.

See you,

Matthew

Wes McKinney

unread,

Mar 23, 2011, 10:46:07 PM3/23/11

to pystat...@googlegroups.com

I see no particular reason why we can't support both APIs. What if we
make endog optional and allow it to be passed into fit() (and if you
do pass it in, it defaults to that value when you call fit)? May seem
hackish but the alternative is making the library unusable for certain
kinds of users (e.g. nipy folks). Can it be done in a backwards
compatible way?

josef...@gmail.com

unread,

Mar 24, 2011, 12:26:39 AM3/24/11

to pystat...@googlegroups.com

Because as I tried to explain I don't think it's worth the pain
(increased complexity of model and result structure) and most likely
it only applies to OLS, or multivariate linear model.
Generalized and robust linear model use iterated least squares, which
creates a new model instance each iteration, WLS, GLS has to do
reweighting in most cases, and I doubt it will gain much or anything
in the Maximum Likelihood estimators that use non-linear optimization.
This is my impression and I don't know how far it is correct, or what
the performance gain would be.

As you have seen in VAR, one big, multivariate y, regression is a lot
more efficient (replacing the loop) than speeding up the loop a bit.

When we need high-throughput code I'm all in favor of adding
specialized classes that can use tricks that only apply for these
cases, but I wouldn't want to make the general purpose models more
complicated because of this.
My emphasis on writing reusable code through separate functions,
inheritance and mixins, is exactly so that it becomes easy to write a
new class or model and inherit the "extras" so that only some core
parts have to be rewritten.

Getting some reusable parts from VAR, SUR, multivariateOLS and
factormodels into multivariate linear model and results classes will
be very useful.

If there are enough use cases, we could create a "clone" method that
creates a *new* model instance based on existing processed X, instead
of just overloading the fit with this for a few models. Nothing is
impossible but I would like to avoid as many headaches from trying to
understand the code as possible.

As another example (I had written the next earlier but didn't want to
overload this thread too much)

just to add one more similar idea.
2/16/2010 Nathaniel offered his code for incremental least squares in scipy-dev.
MDP is piping large data to collect statistics on it.

One approach that would be useful, is to write models based not on the
original data, but based on some sufficient statistic, x'x, x'y, or
pinv(x).
For most result statistics this would be relatively easy, and there
are just a few cases where either the data needs to be obtained in a
second run through the data, or where the results would not be
available without the original data.

And one more example of a headache:
For generic maximum likelihood estimation I wanted to get the profile
loglikelihood for example for confidence intervals for non-linear
parameters. But I gave up because I couldn't think my way through
doing it within the existing models and model hierarchy. The main
point is that it requires repeated fit with different constraints
which means part of the loglikelihood method has to be replaced on the
fly. I was trying to do this all on the same (self) instance and I
just couldn't keep track of what is the original model and what were
the various versions of the model that I had internally rewritten.
When I go back to this it will be either with a clean design or
because I learned some new tricks in the meantime.

Josef

Alan G Isaac

unread,

Mar 24, 2011, 7:48:34 AM3/24/11

to pystat...@googlegroups.com

On 3/23/2011 10:46 PM, Wes McKinney wrote:
> I see no particular reason why we can't support both APIs.

There is a conceptual and practical reason:
exogs without an endog do not constitute a model.
A model for example can produce results.

I have been following this discussion only intermittently
and superficially, so naturally my comment
might miss the current core issues. But I'd like to
recall that this was discussed quite intensively two
years ago, and there were good reasons for the decisions.
The current "limitation" (if it is one) was a design
**decision**, not an oversight.

Probably because I have not been following this discussion
closely, it has looked rather like a solution in search
of a problem. Can someone give a clear statement of the
perceived problem and a list of possible solutions?

E.g., if the issue is access to the transformed exogs,
then that is easy enough. Or, would the issue be fully
addressed if a matrix of endogs were fully handled?

Alan Isaac

Matthew Brett

unread,

Mar 24, 2011, 4:18:36 PM3/24/11

to pystat...@googlegroups.com, Alan G Isaac

Hi,

On Thu, Mar 24, 2011 at 4:48 AM, Alan G Isaac <ais...@american.edu> wrote:
> On 3/23/2011 10:46 PM, Wes McKinney wrote:
>>
>> I see no particular reason why we can't support both APIs.
>
>
> There is a conceptual and practical reason:
> exogs without an endog do not constitute a model.
> A model for example can produce results.

No, I think that's wrong at least in the usual sense of - say - R.
The model is the design, you fit the model to the data, and get
results.

> I have been following this discussion only intermittently
> and superficially, so naturally my comment
> might miss the current core issues. But I'd like to
> recall that this was discussed quite intensively two
> years ago, and there were good reasons for the decisions.
> The current "limitation" (if it is one) was a design
> **decision**, not an oversight.

Yes, there was a decision, but I never did understand why it went the
way that it did. I think you are thinking of the 'Design issue'
thread on the nipy mailing list:

http://mail.scipy.org/pipermail/nipy-devel/2009-June/thread.html

My last email on that thread tried to ask why y'all had made that
decision because I (genuinely) didn't understand it:

http://mail.scipy.org/pipermail/nipy-devel/2009-June/001557.html

I think I've got the same questions then as now. If someone has time
to answer, then I'd be happy.

I suppose that there are some models where the design depends on the
data. I'm imagining that those models felt as if they needed 'exog'
before 'model.fit', and that there was a desire to make the __init__
signature similar for all models. I think the discussion was partly
about whether that made sense or not.

> Probably because I have not been following this discussion
> closely, it has looked rather like a solution in search
> of a problem. Can someone give a clear statement of the
> perceived problem and a list of possible solutions?

Right - I think you haven't been following. We (nipyers) had a
problem in search of a solution, that is, we have tens of thousands of
data vectors to fit with the same model. Typically, in fact, the data
Y is so large that we have to split it into chunks to fit into memory,
as in:

results = []
model = OLS(design_matrix)
for data_chunk in data_chunks:
# where data_chunk is say 800 x 4000
results.append(model.fit(data_chunk))

The obvious solution to the problem is that above - i.e. to allow Y
(exog?) to be a matrix rather than a vector, and to pass it into the
fit method. See the discussion pointed to above for the design
proposal. That was indeed what we had in the original
scipy.stats.models.

> E.g., if the issue is access to the transformed exogs,
> then that is easy enough.

Do you mean the whitened exogs?

>Or, would the issue be fully
> addressed if a matrix of endogs were fully handled?

Well, yes, and also to allow exog of None in instantiating the model.

For the sake of clarity, there are two issues here:

1) Should it be possible for exog to be a matrix? Josef pointed out
something reasonable, which was, if so, it would complicate the
'results' object. Maybe there's a way round that. I mean, there
could be a usual vector results object, with all the bells and
whistles, and another matrix results object, with none such. But
there's no question that it increases complexity.

2) Should it be possible for exog to be changed in a given model
object? Say with model.fit()? That's a different question. Josef's
point is that his usual case is changing the design (endog).
(Actually - is that easy in a given model object? Sorry for the silly
question). Our typical use case is the opposite. Our use case may
be rather specialized, I'm not a statistician....

See you,

Matthew

Matthew Brett

unread,

Mar 24, 2011, 4:36:45 PM3/24/11

to pystat...@googlegroups.com, Alan G Isaac

Hi,

On Thu, Mar 24, 2011 at 1:18 PM, Matthew Brett <matthe...@gmail.com> wrote:
> Hi,
>
> On Thu, Mar 24, 2011 at 4:48 AM, Alan G Isaac <ais...@american.edu> wrote:
>> On 3/23/2011 10:46 PM, Wes McKinney wrote:
>>>
>>> I see no particular reason why we can't support both APIs.
>>
>>
>> There is a conceptual and practical reason:
>> exogs without an endog do not constitute a model.
>> A model for example can produce results.
>
> No, I think that's wrong at least in the usual sense of - say - R.
> The model is the design, you fit the model to the data, and get
> results.

Actually, I take that back. It's a little complicated but:

res = lm(y~A + B +C, data=mydata)

Obviously R here fuses the *formula* (which clearly does include the
data), the *design matrix* (which is prepared using the formula and
the 'endog' data) and Y ('exog').

Now, I'm probably unreasonable in saying that it's obvious that the
right thing to call a 'model' is the design matrix - so I take that
back with an apology. I mean, that is what I'd call the model, but
it's really not obvious, at least to me.

See you,

Matthew

Alan G Isaac

unread,

Mar 24, 2011, 6:17:01 PM3/24/11

to pystat...@googlegroups.com

On 3/24/2011 4:18 PM, Matthew Brett wrote:
> We (nipyers) had a
> problem in search of a solution, that is, we have tens of thousands of
> data vectors to fit with the same model. Typically, in fact, the data
> Y is so large that we have to split it into chunks to fit into memory,

Can you please attach some concepts here,
possibly by describing such an application?

As an economist I hear you saying that you
have a small collection of explanatory variates (X=X1,..,Xn)
and a practically endless collection of things (Y=Y1,..,Ym) you
want X to explain, where the Xi and Yj have the same length (nobs).
Am I hearing you correctly? In the real world,
what are examples of X and Y?

Separately, the word 'model' can mean many different things.
Rather than trying to discover the essence of the term,
consider what is gained by packaging together the abstract
statistic description, the data, and the results as a single object.

Here is what is *lost*, if I understand you:
you want to undertake some specific transformation (whitening)
of X, and then you want to reuse *that unique transformation*
thousands of times. So you don't want to incur the cost
of whitening X in exactly the same way over and over again.
Am I understanding correctly?

Alan

josef...@gmail.com

unread,

Mar 24, 2011, 6:24:04 PM3/24/11

to pystat...@googlegroups.com

I think the main difference is between non-experimental data analysis
and experiments. In experiments you can define the design matrix in
advance and you can repeat the same experimental setup several times
reusing the design matrix.
In that case statistical model and design matrix together completely
describe the data.

Here I don't have any theoretical preference either way, just
practical consideration.

The image processing case is a bit different because you want to
"vectorize" the model estimation, estimating many models at the same
time, while the statistical model would be the big thing where you
have to analyze everything at once, i.e. the multivariate linear model
that Jonathan also discussed.

As you summarized my opinion, I don't think vectorizing (parallel
computation) models will be easy in the general setup, but it's a
common special case of the system of equation or multivariate linear
model.

To get away from the simple OLS case, the linear normal distributed model is

y = X*beta + u u~N(0, Sigma)

or for general distributions, generalized linear model, just specifies
the conditional distribution of the
f(y|x) = dist(x*beta, theta) or similar

the statistical model just specifies a relationship between one set of
data y and another set of data X. Either we have an abstract model and
both x and y are just terms in a formula, or we have a specific model
instance that include *the* data, both x and y.

In general if you also need to estimate Sigma or other parameters, you
cannot do much without having also the y available. just having X is
not enough to precalculate many things
(to be continued after taking care of kids)

Josef

>
> See you,
>
> Matthew
>

Matthew Brett

unread,

Mar 24, 2011, 7:05:02 PM3/24/11

to pystat...@googlegroups.com, Alan G Isaac

Hi,

On Thu, Mar 24, 2011 at 3:17 PM, Alan G Isaac <ais...@american.edu> wrote:
> On 3/24/2011 4:18 PM, Matthew Brett wrote:
>>
>> We (nipyers) had a
>> problem in search of a solution, that is, we have tens of thousands of
>> data vectors to fit with the same model. Typically, in fact, the data
>> Y is so large that we have to split it into chunks to fit into memory,
>
> Can you please attach some concepts here,
> possibly by describing such an application?

The first email in this thread covers the concepts, but I think you
mean an example.

> As an economist I hear you saying that you
> have a small collection of explanatory variates (X=X1,..,Xn)
> and a practically endless collection of things (Y=Y1,..,Ym) you
> want X to explain, where the Xi and Yj have the same length (nobs).
> Am I hearing you correctly? In the real world,
> what are examples of X and Y?

In functional MRI we collect a brain activity image each 1 second or
so. Some things are happening to the person while we scan them that
we think will change the brain activity. Let's say we express those
things as p regressors. Our images are - say - 128 by 128 by 32
pixels - usually called voxels (volume pixels). We have n time points
- say 500. So X (design matrix) is n by p. Y (data matrix) is n by
q where q = 128*128*32. People analyzing EEG have the same problem
with q around 128 or 256 but much bigger n.

> Separately, the word 'model' can mean many different things.
> Rather than trying to discover the essence of the term,
> consider what is gained by packaging together the abstract
> statistic description, the data, and the results as a single object.

And what is lost by conflating terms that are conceptually separable.
I should say, again, that I'm not a statistician, and I might be
misrepresenting their views.

> Here is what is *lost*, if I understand you:
> you want to undertake some specific transformation (whitening)
> of X, and then you want to reuse *that unique transformation*
> thousands of times. So you don't want to incur the cost
> of whitening X in exactly the same way over and over again.
> Am I understanding correctly?

Yes nearly. Optimal whitening of course depends on the data Y. In
our situation we simply can't afford to run the model in a loop voxel
by voxel so we either do no whitening (OLS) or we do some trick like
doing a first pass OLS, estimate AR co-efficients, break the
128*128*32 voxels into subsets with similar AR coefficients, and then
whiten and estimate for these subsets.

See you,

Matthew

Alan G Isaac

unread,

Mar 24, 2011, 8:45:27 PM3/24/11

to pystat...@googlegroups.com

On 3/24/2011 7:05 PM, Matthew Brett wrote:
> In functional MRI we collect a brain activity image each 1 second or
> so. Some things are happening to the person while we scan them that
> we think will change the brain activity. Let's say we express those
> things as p regressors. Our images are - say - 128 by 128 by 32
> pixels - usually called voxels (volume pixels). We have n time points
> - say 500. So X (design matrix) is n by p. Y (data matrix) is n by
> q where q = 128*128*32. People analyzing EEG have the same problem
> with q around 128 or 256 but much bigger n.

So ... this is pretty fascinating.

Iiuc, each pixel gets its own individual projection onto
the exogs (i.e., the design matrix). Do you want anything
besides the projection weights (i.e., regression coefficients)
for the whole group? E.g., do you want pvals for 2**19 sets
of coefficients?

Alan

Matthew Brett

unread,

Mar 24, 2011, 8:49:32 PM3/24/11

to pystat...@googlegroups.com, Alan G Isaac

Hi,

Right - the further analysis usually involves calculating t / F
statistics and p values for each voxel.

Cheers,

Matthew

Alan G Isaac

unread,

Mar 24, 2011, 9:43:13 PM3/24/11

to pystat...@googlegroups.com

On 3/24/2011 8:49 PM, Matthew Brett wrote:
> the further analysis usually involves calculating t / F
> statistics and p values for each voxel

Just to be clear (although it seems quite clear):
that's all 2**19 of them. Right?
You are not filtering them first?

Alan

Matthew Brett

unread,

Mar 24, 2011, 9:52:08 PM3/24/11

to pystat...@googlegroups.com, Alan G Isaac

Oh - sorry - yes - some of them won't be in the brain, you often
filter those - say half.

Matthew

Alan G Isaac

unread,

Mar 24, 2011, 10:09:25 PM3/24/11

to pystat...@googlegroups.com

On 3/24/2011 9:52 PM, Matthew Brett wrote:
> say half

OK, that's still 2**18 regressions
(not just projections but also computed statistics),
and (iiuc) all using OLS with a common design matrix
in the "first round". Is that correct?

Besides OLS (first stage) and AR(1) (second stage),
what other methods would you need?

Alan

Alan G Isaac

unread,

Mar 24, 2011, 10:48:27 PM3/24/11

to pystat...@googlegroups.com

This discussion has been fascinating, but it
has not led me to conclude there is a design
error in the Model class. The concept of a model
there is correct, but for a specific application
the class is not convenient for computational reasons.

The right response is not obvious.
Ignoring computational costs, just using the
existing class is a correct approach.
But it is obviously inefficient.

For OLS, it would be completely straightforward to implement
a "common regressors" class, which would take a Y *matrix*
and produce, I suppose, a tuple of standard results objects.

But in the application under discussion, it is even impractical
to have most of the Yi in one Y matrix. Again
this is an observation about computational limitations,
which bears only pragmatically (not conceptually)
on the implied notion of a model.

I'm going to guess that nevertheless this would meet the
OLS needs well enough. (?) However I can see a possible
opportunity here to separate out a projector class,
where an instance could be shared by several model instances.
This might (???) be interesting.

fwiw,
Alan

Matthew Brett

unread,

Mar 24, 2011, 11:55:30 PM3/24/11

to pystat...@googlegroups.com, Alan G Isaac

Hi,

On Thu, Mar 24, 2011 at 7:48 PM, Alan G Isaac <ais...@american.edu> wrote:
> This discussion has been fascinating, but it
> has not led me to conclude there is a design
> error in the Model class. The concept of a model
> there is correct, but for a specific application
> the class is not convenient for computational reasons.

What do you mean that the concept of the model is correct?

> For OLS, it would be completely straightforward to implement
> a "common regressors" class, which would take a Y *matrix*
> and produce, I suppose, a tuple of standard results objects.

Er - as in B = pinv(X*).dot(Y)?

> But in the application under discussion, it is even impractical
> to have most of the Yi in one Y matrix.

I don't think that's relevant.

> Again
> this is an observation about computational limitations,
> which bears only pragmatically (not conceptually)
> on the implied notion of a model.

> I'm going to guess that nevertheless this would meet the
> OLS needs well enough. (?) However I can see a possible
> opportunity here to separate out a projector class,
> where an instance could be shared by several model instances.
> This might (???) be interesting.

I'm not sure what you mean.

It may be that in practice we're just going to have to revert to what
we had before the design change, that is nipy scipy.stats.models.
Obviously that's a shame and would be a significant duplicated effort,
but it may be that our use case would clutter your implementation
beyond the benefit of the code sharing. It may also be that we
should have made our use case clearer in the early design discussions.

See you,

Matthew

Wes McKinney

unread,

Mar 25, 2011, 2:39:24 PM3/25/11

to pystat...@googlegroups.com

I guess we are getting philosophical here "what is a model?" but a
model (in my view) should represent a generative process for a data
realization, e.g. in the OLS case:

y ~ N(Xb, s^2 I)

given b and s^2, we can generate realizations y from this model. Of
course the estimation problem is: given y_observed estimate b and s^2.
So I don't think a data realization is "part" of the model in a strict
sense-- certainly part of the estimation problem :)

- Wes

Skipper Seabold

unread,

Mar 25, 2011, 3:11:47 PM3/25/11

to pystat...@googlegroups.com

Given b and s^2...? Spoken like a Bayesian ;) We don't have ideas
about these parameters without doing estimation or rather we aren't
doing prior predictive analysis with our models, so estimation and
hence the data is indeed tied up with the model as far as I'm
concerned. See:

http://statsmodels.sourceforge.net/trunk/developernotes.html#design

That's not to say that this isn't very useful and an area I'd like to
see the code base expanded to, but it wouldn't I don't think fit with
the sampling theory that the code is now based on.

I'm not quite sure I don't see why we can't have a special class
MultiOLS (MultiLM), that doesn't derive from model, but has a similar
API to the old stats.models. IIUC, these "models" are all quite
independent (different subjects, same experiment). It sounds like all
you need are f_test and t_test. This doesn't sound like there'd be too
much code-replication (there is certainly a way to share anyhow) or
require much maintenance/enhancements.

Relatedely, AFAICT, there isn't much difference between our API (which
is, again, certainly tied up in the conception of a model) and R's
minus the ease of the formula framework. Let me state this as a
question I guess, and I would be happy to be told that I'm wrong.
Matthew, Jonathan, how would you fit these models in R (or any other
statistical package)? Do you have an example script? Are you selecting
sub-samples of the dataframe with the same "model"? I don't know any
non-bayesian statistical software that isn't quite data-centric, but
I'd be interested to see how you would do it under some other package.

Skipper

Alan G Isaac

unread,

Mar 25, 2011, 3:12:33 PM3/25/11

to pystat...@googlegroups.com

On 3/25/2011 2:39 PM, Wes McKinney wrote:
> I guess we are getting philosophical here "what is a model?"

A statistical model is a mapping from data to parameter estimates
and related statistics.

That mapping is provided by a Model subclass.
(This is not quite right, because no explicit
parameterization can be implied by these classes
until the data is supplied. But it is close
enough to being right to be useful.)

An instance of such a class owns actual data
and produces results, such as parameter estimates
and related statistics.

Alan

josef...@gmail.com

unread,

Mar 25, 2011, 3:24:25 PM3/25/11

to pystat...@googlegroups.com

On Fri, Mar 25, 2011 at 3:12 PM, Alan G Isaac <ais...@american.edu> wrote:
> On 3/25/2011 2:39 PM, Wes McKinney wrote:
>>
>> I guess we are getting philosophical here "what is a model?"
>
>
> A statistical model is a mapping from data to parameter estimates
> and related statistics.

That's an estimator,

a statistical model is just the description of the sampling process
for some observed sample.
(speaking as frequentist). It doesn't specify how you estimate it.

*our* models are estimation classes

Josef
to join the philosophical fun

Alan G Isaac

unread,

Mar 25, 2011, 3:40:30 PM3/25/11

to pystat...@googlegroups.com

On 3/25/2011 3:24 PM, josef...@gmail.com wrote:
> a statistical model is just the description of the sampling process
> for some observed sample.
> (speaking as frequentist). It doesn't specify how you estimate it.

I think that is non-standard usage, or else I don't understand you.
See e.g. McCullagh (2002, Annals of Statistics).

Now McCullagh sounds different that what I said,
because he map parameters to distributions on the sample space,
but from a likelihood perspective that's two sides of the same coin.

That said, I agree with you:
the Model based classes are estimation classes.
I would just add that there is a justification for calling it Model.

fwiw,
Alan

Matthew Brett

unread,

Mar 25, 2011, 4:12:16 PM3/25/11

to pystat...@googlegroups.com

Hi,

Returning the question that I asked a long time ago on the original thread.

Why does the endog have to go with the exog? I can see this statement
"Our working definition of a statistical model is an object that has
both endogenous and exogenous data defined as well as a statistical
relationship" - but I can't see why. It's important to us because it
effectively means we can't use all your hard and valuable work.

In the original design, we did this sort of thing:

obj = OLSModel(X)
results = obj.fit(Y)

The model would be the design matrix and the estimation procedure.

It made sense to us (I mean Jonathan, who's a stats professor) that
there'd be something called a model, which would do the fit to the
data, and return results. Obviously, at a conceptual level, you can
fit the same - let's call it "model" - to different data. That is:

results1 = obj.fit(Y1)
results2 = obj.fit(Y2)

made perfect sense to us.

Now, I think I understand what Josef is saying. He's saying, I think,
that there's internal state in 'obj' that would need resetting for
each Y, and it's boring to keep track of that. I can understand that
statement. It makes me think that there must be some practical reason
why ``results2 = obj.fit(Y2)`` is not easy. But, if you have some
interest in me understanding that - maybe you don't - then could
someone point out in a little more detail what would in fact become
harder if we allowed data to be passed into fit and not passed into
the model constructor?

See you,

Matthew

josef...@gmail.com

unread,

Mar 25, 2011, 4:24:03 PM3/25/11

to pystat...@googlegroups.com

On Fri, Mar 25, 2011 at 3:40 PM, Alan G Isaac <ais...@american.edu> wrote:
> On 3/25/2011 3:24 PM, josef...@gmail.com wrote:
>>
>> a statistical model is just the description of the sampling process
>> for some observed sample.
>> (speaking as frequentist). It doesn't specify how you estimate it.
>
>
> I think that is non-standard usage, or else I don't understand you.
> See e.g. McCullagh (2002, Annals of Statistics).
>
> Now McCullagh sounds different that what I said,
> because he map parameters to distributions on the sample space,
> but from a likelihood perspective that's two sides of the same coin.

I don't know the reference, but I'm not just looking at a likelihood
perspective.

What I meant was that the statistical model just specifies the data
generating process,

y ~ Arma(p,q) with normal distributed noise
or
y = X* b + u
X = Z * gamma + v
E(Z*u) = 0, E(Z*v) = 0... whatever else we need for an instrumental
variable model

the estimator is a mapping (y,X,Z) -> beta_hat, gamma_hat, ...

The same statistical model could be estimated by least squares, GMM,
LIML, FIML, empirical likelihood, distance estimators. The statistical
model (assumptions on DGP) would be the same (although the different
estimator use different assumptions for consistency, ....), but the
mapping data to parameter estimates is different.

Our models are estimators for specific models, sometimes we have
several model (estimation) classes for the same statistical model,
e.g. discrete - glm, IV2SLS - GMM, sometimes we have different
estimators within the same model class (ARMA - exact MLE, conditional
MLE, conditional least-squares.)

(And to make it a bit messier, except for maximum likelihood and
Bayesian, none of the other estimators needs a fully specified
statistical model, some semi-parametric estimators only need a few
moment restrictions, and we will have a non-parametric section.)

Josef
(since I couldn't resist)

josef...@gmail.com

unread,

Mar 25, 2011, 5:10:06 PM3/25/11

to pystat...@googlegroups.com

Outside of experiments that's a very unusual use case. Most of the
time there is no reason to distinguish when y and x are available. In
the common case of model selection, it would go the other way,
initialize with y and then try out many different X, so it could be
ob.fit(X) (we do this in several places internally.)

>
> Now, I think I understand what Josef is saying. He's saying, I think,
> that there's internal state in 'obj' that would need resetting for
> each Y, and it's boring to keep track of that. I can understand that
> statement. It makes me think that there must be some practical reason
> why ``results2 = obj.fit(Y2)`` is not easy. But, if you have some
> interest in me understanding that - maybe you don't - then could
> someone point out in a little more detail what would in fact become
> harder if we allowed data to be passed into fit and not passed into
> the model constructor?

One: is the simplicity of design across models, in most cases you
cannot do anything without the y. GLSAR requires y to estimate the AR
error process. So fit(y) would not allow any preparation of the data
before calling fit, GLSAR also has an iterative_fit method, not sure
now between fit ant iterfit.
Having both X and y in the __init__ treats the data symmetrically and
we can do the data preparation immediately, (including calling
supporting functions like removing rows with nans).

Two: only allowing data in the __init__ enforces "one instance per
estimation problem", since users have no official way to change the
underlying data. Since it's python there are inofficial ways. "one
instance per estimation problem" means we don't have to worry about
stale state.
Otherwise we need a way to create a clean model instance.

Three: the current fit method also treats the model symmetrically, all
actual calculations are done in fit, no pre-prepared extra
calculations of x are used.
This could be changed for OLS and maybe GLS, but I don't think it will
work for any other models, e.g. Maximum Likelihood estimators, glm,
discrete, I don't know the details of rlm well enough.

Where it looks feasible, OLS, WLS and GLS with predefined whitening
arrays (not GLSAR_iterfit)

__init__(X) : remove preprocessing of y, separate initialize in y part
and x part
fit(y) :
<wipe or create new instance)
add call to preprocessing of y (whitening in initialize)
check if pinv(x) and similar are available if yes, then use it
return regression result instance, <and add a clean model instance>

the problem is we don't have a way to create a clean model instance,
so either we give up on attaching the model instance to the result or
allow empty model instances

__init__() : do nothing
prepare_exog(X) : initialize exog, calculate pinv(x) or QR(X)
fit(y) :
<create new instance)
add call to preprocessing of y (whitening in initialize)
check if pinv(x) and similar are available if yes, then use it

attach X and whatever else is needed to new model instance and attach
it to results
return regression result instance

The latter is how I would probably write it, but prepare exog wouldn't
be doing much in many models.
The option of not attaching the model instance to the result class
works well, if you don't care about properties/results that need
access to the model class and the data, f_test, t_test, .... would be
fine.

Josef

>
> See you,
>
> Matthew
>

Alan G Isaac

unread,

Mar 25, 2011, 5:55:35 PM3/25/11

to pystat...@googlegroups.com

Nice explanation, Josef.

Matthew said he will need AR(1) as well as OLS.
(I asked what else but he did not say yet.)
So perhaps it makes a good test case for thinking
about what might meet his needs. (Btw, Matthew,
did you mean that the errors are AR(1),
or that the model includes a lagged endogenous variable?
I've been assuming the latter. Anyway, the relevant issues
are the same.)

In particular, it is a case where it is clear
that you cannot correctly reuse the transformed
X from the model with say Y1 when you change to Y2.
In this sense this case should answer Matthew's question
about why a Model should own both X and Y.

(Matthew, is that clarifying?)

Now I think (?) Matthew has said something like
"regardless, I want to use the same
AR parameter for a bunch of different Ys".
(Did I hear that right? Not sure.)

A focus on this case may highlight what might
work as an accommodation of Matthews needs.
It sounds like he needs a model that imposes
a parameter constraint (e.g., fixes rho).

Alan

josef...@gmail.com

unread,

Mar 25, 2011, 6:11:22 PM3/25/11

to pystat...@googlegroups.com

looking at https://github.com/nipy/nipy/blob/master/nipy/fixes/scipy/stats/models/regression.py

Not getting stale state works, because there is nothing y specific
attached to the model, In OLS y is just used temporarily in the fit
method and attached to the results instance, the state includes X
which is what you want.

ARModel iterative_fit (corresponding to our GLSAR) doesn't return
anything but attaches rho as a result of iterative fit. This self.rho
is then used in the next whiten. If you call fit on a new y, it will
use the rho estimated in the last call to iterative_fit. I don't know
if this is what you want, but it's not the behavior I would expect,
and looks dangerously like state spill-overs to me.
(Just reading the code, maybe I'm still misinterpreting this. If you
have an example you could try if I'm wrong.)

GLSAR is still easy because it has only one additional parameter, and
the similar problem will show up even more in other models.

As example for my allergy to sharing instances. They require great
care, and lot's of work.

Josef

>
> Josef
>
>
>>
>> See you,
>>
>> Matthew
>>
>

Matthew Brett

unread,

Mar 25, 2011, 6:16:36 PM3/25/11

to pystat...@googlegroups.com, josef...@gmail.com

Hi,

It is surely not good if statsmodels is not well adapted to analyzing
experiments? It is really true that most people coming to python and
statistics are not doing experiments?

Don't you always need X in order what to know what to do with Y? I'm
thinking of the AR case.

>> Now, I think I understand what Josef is saying. He's saying, I think,
>> that there's internal state in 'obj' that would need resetting for
>> each Y, and it's boring to keep track of that. I can understand that
>> statement. It makes me think that there must be some practical reason
>> why ``results2 = obj.fit(Y2)`` is not easy. But, if you have some
>> interest in me understanding that - maybe you don't - then could
>> someone point out in a little more detail what would in fact become
>> harder if we allowed data to be passed into fit and not passed into
>> the model constructor?
>
> One: is the simplicity of design across models, in most cases you
> cannot do anything without the y. GLSAR requires y to estimate the AR
> error process. So fit(y) would not allow any preparation of the data
> before calling fit, GLSAR also has an iterative_fit method, not sure
> now between fit ant iterfit.

Right, but surely that's completely taken care of by:

def fit(self, Y):
Yproc = self.prepare_data(Y)
etc

> Having both X and y in the __init__ treats the data symmetrically and
> we can do the data preparation immediately, (including calling
> supporting functions like removing rows with nans).

But the 'symmetrically' sounds a bit abstract here. Surely the data
preparation could as easily be done in the fit method?

> Two: only allowing data in the __init__ enforces "one instance per
> estimation problem", since users have no official way to change the
> underlying data. Since it's python there are inofficial ways. "one
> instance per estimation problem" means we don't have to worry about
> stale state.
> Otherwise we need a way to create a clean model instance.

You mean, enforcing data in the __init__ I think. The proposal in
the previous thread on the nipy mailing list was - I think - similar
to what you were proposing in another email, that is, to attach the
estimator (that does depend on the data) to the results. Now, the
estimator could be another object, or it could be a modified instance
of the model. I can think of ways of handling that anyway.

> Three: the current fit method also treats the model symmetrically, all
> actual calculations are done in fit, no pre-prepared extra
> calculations of x are used.
> This could be changed for OLS and maybe GLS, but I don't think it will
> work for any other models, e.g. Maximum Likelihood estimators, glm,
> discrete, I don't know the details of rlm well enough.

Right - yes - so here I don't know why that is the case - I'm sorry -
pure ignorance on my part - is it easy to explain?

Could it not be so that some models do require the data to __init__
and some don't? I mean:

class Model(object):
def __init__(self, endog, exog=None):

def fit(self, exog=None):

as in fact you have it for OLS now, and as I think Wes was suggesting.

> Where it looks feasible, OLS, WLS and GLS with predefined whitening
> arrays (not GLSAR_iterfit)
>
> __init__(X) : remove preprocessing of y, separate initialize in y part
> and x part
> fit(y) :
> <wipe or create new instance)
> add call to preprocessing of y (whitening in initialize)
> check if pinv(x) and similar are available if yes, then use it
> return regression result instance, <and add a clean model instance>
>
> the problem is we don't have a way to create a clean model instance,
> so either we give up on attaching the model instance to the result or
> allow empty model instances

Well, could you not do:

def fit(self, endog):
new_model = self.copy()
new_model._set_endog(endog)
results = new_model._do_fit()
return results

? In this case we are conflating what has been called the estimator
with the 'model'. I could imagine separating them too.

Something like the above was the suggestion in the earlier thread I think.

Thanks - I think I'm getting a better grip on the problem now.

Matthew

josef...@gmail.com

unread,

Mar 25, 2011, 8:11:31 PM3/25/11

to pystat...@googlegroups.com

experiments that are identically repeated, have no experiment specific
explanatory variables, and where one additional pinv in one model type
matters.

I assume most users are not in this group.

>
> Don't you always need X in order what to know what to do with Y? I'm
> thinking of the AR case.

univariate time series analysis has only a y, vector autoregressive
has only a 2d y,
decomposition methods (PCA, ...) in scikits.learn only have an X

estimating a distribution from univariate data also only has a y, but
often needs X

>
>>> Now, I think I understand what Josef is saying. He's saying, I think,
>>> that there's internal state in 'obj' that would need resetting for
>>> each Y, and it's boring to keep track of that. I can understand that
>>> statement. It makes me think that there must be some practical reason
>>> why ``results2 = obj.fit(Y2)`` is not easy. But, if you have some
>>> interest in me understanding that - maybe you don't - then could
>>> someone point out in a little more detail what would in fact become
>>> harder if we allowed data to be passed into fit and not passed into
>>> the model constructor?
>>
>> One: is the simplicity of design across models, in most cases you
>> cannot do anything without the y. GLSAR requires y to estimate the AR
>> error process. So fit(y) would not allow any preparation of the data
>> before calling fit, GLSAR also has an iterative_fit method, not sure
>> now between fit ant iterfit.
>
> Right, but surely that's completely taken care of by:
>
> def fit(self, Y):
> Yproc = self.prepare_data(Y)
> etc
>
>> Having both X and y in the __init__ treats the data symmetrically and
>> we can do the data preparation immediately, (including calling
>> supporting functions like removing rows with nans).
>
> But the 'symmetrically' sounds a bit abstract here. Surely the data
> preparation could as easily be done in the fit method?

example maximum likelihood estimation for discrete models
http://bazaar.launchpad.net/~scipystats/statsmodels/devel/view/head:/scikits/statsmodels/discrete/discrete_model.py

each model defines loglike, and first and second derivatives (score
and hessian) each has access to self.endog and self.exog, loglike is
used for optimize.fmin or similar.

GenericLikelihoodModel might define several additional methods, that
all only make sense if the data, X and y has been attached.

It's symmetric in the sense that loglike = pdf(y-X*beta) in the
simplest case (because u=y-Xb ~ N(0,Sigma) is the model assumption for
the likelihood), or pdf(y, mu(x*beta), scale(x)) in glm type models

>
>> Two: only allowing data in the __init__ enforces "one instance per
>> estimation problem", since users have no official way to change the
>> underlying data. Since it's python there are inofficial ways. "one
>> instance per estimation problem" means we don't have to worry about
>> stale state.
>> Otherwise we need a way to create a clean model instance.
>
> You mean, enforcing data in the __init__ I think. The proposal in
> the previous thread on the nipy mailing list was - I think - similar
> to what you were proposing in another email, that is, to attach the
> estimator (that does depend on the data) to the results. Now, the
> estimator could be another object, or it could be a modified instance
> of the model. I can think of ways of handling that anyway.

maybe I wasn't clear here:
"one instance of each class" i.e. one estimation (model) instance, one
instance of the regression class, and maybe one instance of the
process class.
What I want to avoid is that different data estimation problems,
different X and/or y, use one and the same instance of the model
class.
The alternative is complete control and housekeeping of any data
specific attributes, which as Alan said goes against the usefulness of
an OO design.

>
>> Three: the current fit method also treats the model symmetrically, all
>> actual calculations are done in fit, no pre-prepared extra
>> calculations of x are used.
>> This could be changed for OLS and maybe GLS, but I don't think it will
>> work for any other models, e.g. Maximum Likelihood estimators, glm,
>> discrete, I don't know the details of rlm well enough.
>
> Right - yes - so here I don't know why that is the case - I'm sorry -
> pure ignorance on my part - is it easy to explain?

see above, call to optimize.fmin or similar needs to do *all* the
calculations for each parameter candidate, so I don't think that there
is much to precalculate in general.
glm and some other estimators do iterative fit, which also requires
recalculating in each case, e.g. new prewhiten, new pinv, ...

I don't think that statsmodels is very efficient in the details, but I
also don't think we are far off.

Your case of fitting the same GLSAR with the same AR coefficients to
several y is really special because usually (?) the assumption is that
the estimate of error process parameters are y-specific.

>
> Could it not be so that some models do require the data to __init__
> and some don't? I mean:
>
> class Model(object):
> def __init__(self, endog, exog=None):
>
> def fit(self, exog=None):
>
> as in fact you have it for OLS now, and as I think Wes was suggesting.

OLS exog=None is inconsistent because the super-class GLS still
requires it and raises the exception.

We use exog=None for models that don't require an X, mainly univariate
time series analysis for now.
If they require an exog, then it has to be included in the
instantiation, there is currently no way to add data outside the model
__init__

>
>> Where it looks feasible, OLS, WLS and GLS with predefined whitening
>> arrays (not GLSAR_iterfit)
>>
>> __init__(X) : remove preprocessing of y, separate initialize in y part
>> and x part
>> fit(y) :
>> <wipe or create new instance)
>> add call to preprocessing of y (whitening in initialize)
>> check if pinv(x) and similar are available if yes, then use it
>> return regression result instance, <and add a clean model instance>
>>
>> the problem is we don't have a way to create a clean model instance,
>> so either we give up on attaching the model instance to the result or
>> allow empty model instances
>
> Well, could you not do:
>
> def fit(self, endog):
> new_model = self.copy()
> new_model._set_endog(endog)
> results = new_model._do_fit()
> return results
>
> ? In this case we are conflating what has been called the estimator
> with the 'model'. I could imagine separating them too.
>
> Something like the above was the suggestion in the earlier thread I think.

Yes that would be my preferred way, having a clean "clone" method.
It's still not easy, because we need the right self.copy, we don't
want to make any physical copies of the data, but for many things we
don't want to copy a mutable reference. This requires some
housekeeping before it works without copying or not copying the wrong
things.

My main problem to change this in the standard model structure, is,
that the current design has worked pretty well across all types of
models and I have written enough draft version of different ones in
the sandbox to have a rough idea by now. The linear regression model,
where a change might be useful, will be only a small fraction of the
models that will be available in statsmodels (although the most
heavily used).

Coming back to the original problem, I think the multivariate linear
model would be a lot more efficient for your estimation problem. If
you just want to do AR(1) instead of AR(p), then I'm pretty sure it
can also easily be done in a vectorized way.
I think it would be useful to have estimation classes (like linear
model) that are specifically written for fast execution but are maybe
more limited in the available results. (You can check the regression
results instance to see if you really need all the "extras".) And they
wouldn't have to follow the generic pattern.

Cheers,

Josef

josef...@gmail.com

unread,

Mar 25, 2011, 11:43:33 PM3/25/11

to pystat...@googlegroups.com

Coming back to this one more time, in your usecase you don't really
want to store the complete model and result with all the data
attached, assuming it get's too large to process it all at once. In
this case, you would need to throw away the data (set to None) at some
point.
In that case, we wouldn't have to worry about incorrect state (copying
from what I have written early in the thread)

mod = OLS(y[:,0], exog)

results = []
for idx in arange(y.shape[1]):
endog = y[:,idx]
mod.wendog = mod.whiten(endog)
res = mod.fit()
results.append((res.params, res.tvalues, res.f_test(C)))

mod.endog = None
mod.wendog = None
or return nothing

I think this is how I would write a parametric bootstrap that Jonathan
mentioned, and we do something a bit similar in internal functions.

Create or change models, get the results we want out of the results
instance and then drop it.
Controlling state and updating is confined within one function/method,
and nobody get's it's hand on a potentially incomplete or stale
instance.

What we would need to change in OLS and similar in the current
implementation is that fit stores and reuses it's intermediate
calculations for X, mainly pinv, ...

This looks reasonable for large loops, if, insteadm you want to look
at an entire result instance and keep it, then an additional pinv
might not make much difference.

Josef

Matthew Brett

unread,

Mar 25, 2011, 11:56:59 PM3/25/11

to pystat...@googlegroups.com

Hi,

True, but nevertheless common enough for a chapter in the book Jonathan cited.

>> Don't you always need X in order what to know what to do with Y? I'm
>> thinking of the AR case.
>
> univariate time series analysis has only a y, vector autoregressive
> has only a 2d y,
> decomposition methods (PCA, ...) in scikits.learn only have an X
>
> estimating a distribution from univariate data also only has a y, but
> often needs X

Right - but this tells me that passing X and Y for all models may not
be a good one-size-fits-all choice.

Thank you - the clouds of my lack of understanding are starting to
lift. So here what you are saying is that the helper methods for
``fit()`` want to know the data and they may also want some
mid-fitting intermediate results. The most sensible way to do that
seemed to be to put the data, and the intermediate fit results into
the object itself as attributes.

How about instead having a fit_state object which contains the stuff
necessary for the fit, including the data? As in:

def fit(self, exog):
class FitState(object): pass
fit_state = FitState()
fit_state.exog = exog
for i in range(self.niter):
self.mid_fit(self, fit_state)

def mid_fit(self, fit_state):
# do something

I remember OO design coming up in the previous thread, but on its own
the idea of OO doesn't tell you what goes with what. For example
here, are the objects a Model which is also an Estimator, or is this a
Model which makes an Estimator when doing fittin?. The data-specific
attributes in the above snippet go into the FitState. Is that the
right OO design? A judgement call I think.

>>> Three: the current fit method also treats the model symmetrically, all
>>> actual calculations are done in fit, no pre-prepared extra
>>> calculations of x are used.
>>> This could be changed for OLS and maybe GLS, but I don't think it will
>>> work for any other models, e.g. Maximum Likelihood estimators, glm,
>>> discrete, I don't know the details of rlm well enough.
>>
>> Right - yes - so here I don't know why that is the case - I'm sorry -
>> pure ignorance on my part - is it easy to explain?
>
> see above, call to optimize.fmin or similar needs to do *all* the
> calculations for each parameter candidate, so I don't think that there
> is much to precalculate in general.
> glm and some other estimators do iterative fit, which also requires
> recalculating in each case, e.g. new prewhiten, new pinv, ...
>
> I don't think that statsmodels is very efficient in the details, but I
> also don't think we are far off.
>
> Your case of fitting the same GLSAR with the same AR coefficients to
> several y is really special because usually (?) the assumption is that
> the estimate of error process parameters are y-specific.

Yes, that is true.

Right. As you know, I work on nipy and we've suffered terribly from
heated discussions in the abstract, so I completely defer to your
collective experience with the code. I'm afraid I'm hoping that me
trying to understand why you did what you did might be of some benefit
to y'all too.

> Coming back to the original problem, I think the multivariate linear
> model would be a lot more efficient for your estimation problem. If
> you just want to do AR(1) instead of AR(p), then I'm pretty sure it
> can also easily be done in a vectorized way.
> I think it would be useful to have estimation classes (like linear
> model) that are specifically written for fast execution but are maybe
> more limited in the available results. (You can check the regression
> results instance to see if you really need all the "extras".) And they
> wouldn't have to follow the generic pattern.

Thanks. For some reason my brain did not fully take that last
paragraph in, but I'll let it settle until tomorrow.

See you,

Matthew

josef...@gmail.com

unread,

Mar 26, 2011, 1:00:35 AM3/26/11

to pystat...@googlegroups.com

>>>> Outside of experiments that's a very unusual use case. Most of the
>>>> time there is no reason to distinguish when y and x are available. In
>>>> the common case of model selection, it would go the other way,
>>>> initialize with y and then try out many different X, so it could be
>>>> ob.fit(X) (we do this in several places internally.)
>>>
>>> It is surely not good if statsmodels is not well adapted to analyzing
>>> experiments? It is really true that most people coming to python and
>>> statistics are not doing experiments?
>>
>> experiments that are identically repeated, have no experiment specific
>> explanatory variables, and where one additional pinv in one model type
>> matters.
>>
>> I assume most users are not in this group.
>
> True, but nevertheless common enough for a chapter in the book Jonathan cited.

I don't have the book but from the literature I was able to look up:

The full statistical model is just one big model and a sequence of OLS
by chance/special case.

y, X with y.shape[1] = 2^18
and a error covariance matrix of 2^18 * 2^18
which is just too big and requires shortcuts.
The textbook version might even use a sparse kronecker product
representation with
shape (500*2^18) * (500*2^18) (now that's scary, even when sparse)

But batch processing 1000 or 10000 y at the same time, would be closer
to what I would consider as a "usual" (?) approach. Jonathan provided
the function for it for a limited case.
That's what I was trying to say below.

Josef
(To take one piece out before turning of the computer)

Jonathan Taylor

unread,

Mar 30, 2011, 2:54:15 AM3/30/11

to pystat...@googlegroups.com

On Fri, Mar 25, 2011 at 12:12 PM, Alan G Isaac <ais...@american.edu> wrote:

On 3/25/2011 2:39 PM, Wes McKinney wrote:

I guess we are getting philosophical here "what is a model?"

A statistical model is a mapping from data to parameter estimates
and related statistics.

I would say that you are describing a __parametric__ statistical model.
To me, a statistical model is a specification of a (joint or conditional)
distribution of a tuple (response, predictors).

You only get parameter estimates when you say this distribution is parametric,
and you describe a particular parametrization of the set of possible distributions.

There are also semi-parametric models like

Y = X \beta + \epsilon

where we don't assume \epsilon is N(0, \sigma^2 I).

You can use OLS for both of these situations, but in the semi-parametric setting, one might use a residual bootstrap to estimate / calibrate the distribution of the parameter estimates in this model.

There are also non-parametric statistical models, of course...

Jonathan

--
Jonathan Taylor
Dept. of Statistics
Sequoia Hall, 137
390 Serra Mall
Stanford, CA 94305
Tel: 650.723.9230
Fax: 650.725.8977
Web: http://www-stat.stanford.edu/~jtaylo

Jonathan Taylor

unread,

Mar 30, 2011, 4:22:01 AM3/30/11

to pystat...@googlegroups.com

On Fri, Mar 25, 2011 at 12:40 PM, Alan G Isaac <ais...@american.edu> wrote:

On 3/25/2011 3:24 PM, josef...@gmail.com wrote:

a statistical model is just the description of the sampling process
for some observed sample.
(speaking as frequentist). It doesn't specify how you estimate it.

I think that is non-standard usage, or else I don't understand you.
See e.g. McCullagh (2002, Annals of Statistics).

Now McCullagh sounds different that what I said,
because he map parameters to distributions on the sample space,
but from a likelihood perspective that's two sides of the same coin.

Well, up to identifiability they are sort of the same... otherwise the mapping is not 1:1.
This mapping depends on a particular parametrization of the set of distributions
on the sample space. This parametrization is typically expressed in terms of a design matrix.

I should also re-emphasize Josef's point that this mapping you describe is really
statistical estimation and McCullagh says nothing about statistical estimation.
McCullagh focuses on the parameterization of a set of distributions.

Yes, it is true that statistical estimators do depend on the response or "data", Y
because the estimators are given by some function F of Y, i.e.
the estimators are a measurable function of Y. However, this
function or mapping F does not depend on Y, though its
evaluated values obviously does. I can't say that I have ever absorbed all of McCullagh's paper, but
whatever else you want to say about it, this mapping also does not depend on
Y.

More precisely, let's look at the OLS model in which the only thing random is the response (it is a model for the conditional distribution of Y given X). In this case, then,
the model, or set of possible distributions is described by the possible means and variances
of the response (c.f. top of p.1236 of McCullagh's paper). This parameter space is
the product of the column span of the design matrix and (0,+\infty). A design matrix provides a parametrization of this column space, but this parametrization
may be many to one.

Once we've described our model, we should think about the estimation procedure. While McCullagh does not give a precise definition of estimation, fundamentally, the estimator is a measurable function from the data to the parameter space. Therefore, the estimation function F in this case should return a tuple: the estimated
mean of Y (a point in the column space of the design matrix) and the estimated variance of the noise. The OLS class' fit method
does not actually return F(Y), though F[1] does provide a proper estimate of the variance.
The fit method essentially returns

beta, sigma = (dot(pinv(X), F[0]), F[1])

and we can recover F by (dot(X, beta), sigma).

To be fair, we could also consider
the parameter space to be R^p \times (0,\infty) where X.shape = (n,p) but
I prefer to think of the parameter space as a subspace of R^n \times (0,\infty). This is
a matter of taste, in some sense. In this latter case, OLS' fit method does really provide
a measurable function of Y into the parameter space (though its fit method
does not take Y as an argument...).

Any non-singular transformation of the
columns of the design matrix (multiplcation on the right by a non-singular matrix) provides a different parametrization, i.e.
the first element of the tuple fit returns would change.
However, my statistical model in which the parameter space is a subspace of R^n is the same, by McCullagh's definition, as long as the column space is unchanged. If the design matrix is not full rank, then there are possibly singular transformations that preserve the statistical model but the parametrization is completely different, even of different dimension.

So, I don't think it's as simple as two sides of the same coin... and I think McCullagh would agree with me here.

My semi-parametric bootstrap model I mentioned in an earlier email is one in which the set of distributions is described by the column space of a design matrix and a non-parametric set of distributions. If I only assume that my noise is IID and has second moment, then the "parameters" for this model is the product of the column space of the design matrix and the set of measures on R with finite second moment. While I can't describe
this model in a simple parametric form, I could use the bootstrap and the OLS class to estimate the distribution of the least squares estimator of \beta in my particular parametrization of the column space... My concern would be that I might have to compute pinv(X) many many times if I can only pass (X,Y) to the constructor of OLS.

What Matthew (and I) want to convey is really that the standard normal theory linear regression model is described completely by the design matrix because the design matrix determines the column space in the parameter space of the model. The mapping that McCullagh describes takes a tuple (mean vector, sigma) and returns a distribution

N(mean vector, \sigma^2 I)

This mapping is not a function of Y.

Estimation of the parameters of this statistical model is most closely approximated by the function F I described above, which also does not depend on Y.

Obviously, though, the estimators depend on Y as they are equal to (some simple functions of) F(Y).

In short:

* A statistical model (according to McCullagh) is a function, but this function does not depend on the response. The model is a mapping is a parametrization of a set of distributions, i.e. a function from a parameter space to the set of probability measures on some fixed measurable space M which is the range of Y.

* Statistical estimation is described by a function, F, though this function does not depend on the response. Formally, F is a function from this fixed measurable space M to the parameter space.

* Statistical estimators are the composition of the "Statistical estimation" with Y, i.e. F(Y). As Y is itself a function (because it is a random variable defined on some sample space \Omega), statistical estimators are therefore functions from the sample space \Omega to the parameter space.

* OLS' fit method returns the statistical estimators evaluated at the observed data Y(\omega) which is a point in M. Now, for us M=R^n so we can think of Y(\omega) is a numpy array of shape (n,). Therefore, OLS' fit method returns one point in the parameter space: F(Y(\omega)).

* If we follow McCullagh's definition of a model, then its constructor should not require Y.

-- Jonathan

Alan G Isaac

unread,

Mar 30, 2011, 9:39:30 AM3/30/11

to pystat...@googlegroups.com

Perhaps I diverted this discussion by mentioning McCullagh (2002),
but I thought we were getting bogged down in personal definitions.
It is available online, so I thought it might be a useful reference.
http://galton.uchicago.edu/~pmcc/publications.html
On p.1236 he characterizes a statistical model diagrammatically.
In terms of this diagram it seems to me that Jonathan is focusing
on the Design, whereas the Model involves the Sample space.

But even if I am right, so what?
Such abstractions may or may not be helpful.
We need classes that can do certain work.

As an example of work that needs to be done,
consider Y(t) = a Y(t-1) + noise.
I do not see how Jonathan and Matthew propose
that we understand this "model" (small "m").

As Josef stressed, the Model-based *classes*
in statsmodels merely represent an estimation procedure.
However *instances* own their data.
An estimation cannot proceed without this data,
so this seems quite proper.

Jonathan and Matthew are raising a question that I reinterpret
(so that I can try to understand it) as, why not have a
separate Design class that has a fit method that takes Y as an argument.
(That is, something like earlier incarnations of the Model class.)

Once you provide Y you can make a Model instance.
So the reason for doing what they want
must be computational, not conceptual.
(This seems obvious to me; am I missing something?)

Specifically, iiuc, they are trying to avoid repeatedly
manipulating the design matrix in the same way (e.g.,
repeatedly producing the projection matrix), because
they want to use the same design with a very large
number of response variables.

Finding a nice way to accommodate this is reasonable,
but as Josef (iirc) has pointed out, it seems likely that
only a small subset of estimation classes imply an
obvious way to accommodate that need.

Alan

Alan G Isaac

unread,

Apr 5, 2011, 10:20:10 PM4/5/11

to pystat...@googlegroups.com

On 3/24/2011 11:55 PM, Matthew Brett wrote:
> It may be that in practice we're just going to have to revert to what
> we had before the design change, that is nipy scipy.stats.models.

I don't see how this follows in any way.

I think the question is:
how can we leverage the statsmodels facilities
for a particular *computational* need. I have
argued that the issue is computational rather
than conceptual and have not heard a denial.
(Perhaps due to a certain deafness; if so, humor me.)

I'm still curious as to your (and Jonathan's) response
to my last email on this topic. And in particular
I remain curious as to how you two describe the simple case

Y(t) = f(Y(t-1),theta) + e(t)

What is the "model" here that does not require the "response"
in the constructor?

Thanks,
Alan

Alan G Isaac

unread,

Apr 5, 2011, 10:22:37 PM4/5/11

to pystat...@googlegroups.com

On 4/5/2011 10:20 PM, Alan G Isaac wrote:
> I think the question is:
> how can we leverage the statsmodels facilities
> for a particular *computational* need.

And btw, figuring out what this means before the GSoC
might have a payoff.

Alan

Skipper Seabold

unread,

Apr 6, 2011, 12:35:55 PM4/6/11

to pystat...@googlegroups.com

On Tue, Apr 5, 2011 at 10:20 PM, Alan G Isaac <ais...@american.edu> wrote:
> On 3/24/2011 11:55 PM, Matthew Brett wrote:
>>
>> It may be that in practice we're just going to have to revert to what
>> we had before the design change, that is nipy scipy.stats.models.
>
>
> I don't see how this follows in any way.
>
> I think the question is:
> how can we leverage the statsmodels facilities
> for a particular *computational* need. I have
> argued that the issue is computational rather
> than conceptual and have not heard a denial.
> (Perhaps due to a certain deafness; if so, humor me.)
>

So, I don't have time to go back through the whole thread and maybe
this is really naive at best and plain stupid at worst, but IIUC, you
just want to avoid the expensive recomputing of pinv(exog). Why can't
we just compute the pinv in the __init__ as Josef noted above I
believe and then have something *like* (we would also have to make the
whitened endog, exog cached_writable attributes that update on
changes)

from scikits.statsmodels.regression.linear_model import GLS
from copy import deepcopy, copy

class NewGLS(GLS):
def __init__(self, endog, exog, sigma=None):
super(NewGLS, self).__init__(endog, exog, sigma)

def initialize(self, **kwargs):
"""
Takes whatever GLS takes. Only overwrites what it needs to.

Could possibly even go up to Model class
"""
attrs = kwargs.keys()
# NewCLS = deepcopy(self)
# Or can we always get away with shallow copy?
NewCLS = copy(self)
for attr in attrs:
if getattr(self, attr) is not None:
setattr(NewCLS, attr, kwargs[attr])
return NewCLS

The copy is bound to be less expensive than the psuedoinverse of a
"big" exog. We might then want to have a separate check_conformability
method to make sure that it only gets sensible attributes (which init
would also rely on). So you can do something like

import scikits.statsmodels as sm
dta = sm.datasets.longley.load()

mod = NewGLS(dta.endog, dta.exog)

for endog in dta.exog.T:
newmod_results = mod.initialize(endog=endog).fit()
... do something

That way the results objects will never hold a reference to a stale
model, you can only compute what you want from the new model. The only
less than ideal part is the copy, but I don't see how to avoid it.

Jonathan Taylor

unread,

Apr 19, 2011, 2:55:20 AM4/19/11

to pystatsmodels

On Wed, Mar 30, 2011 at 6:39 AM, Alan G Isaac <ais...@american.edu> wrote:

Perhaps I diverted this discussion by mentioning McCullagh (2002),
but I thought we were getting bogged down in personal definitions.
It is available online, so I thought it might be a useful reference.
http://galton.uchicago.edu/~pmcc/publications.html
On p.1236 he characterizes a statistical model diagrammatically.
In terms of this diagram it seems to me that Jonathan is focusing
on the Design, whereas the Model involves the Sample space.

My model (or one of my representations of a model) is a subspace of the sample space: the column space of a
design matrix. In any case, neither of those two diagrams depend on the point in the sample space your
experiment produced, i.e. the ndarray Y. The models are defined as functions on things like the sample space (and other spaces) which can be evaluated at any particular Y. Evaluating the proper function at a particular Y gives you a realized value of
a statistical estimator, but the statistic, i.e. the random variable, is the function itself.

Perhaps the simplest example of a statistical model is:

Y_i \sim u + \epsilon_i, \epsilon_i \overset{IID}{\sim} N(0, 1)

The statistical model in Section 1 of McCullagh is the mapping P:R \rightarrow M(R) where M(R^n) is the set of probability distributions on R^n defined by
P(mu) = N(mu np.ones(n), np.identity(n)). I don't need to know the realized value of Y to determine my mapping P...

But even if I am right, so what?
Such abstractions may or may not be helpful.
We need classes that can do certain work.

Agreed. And I do agree with you that the need seems to be more computational rather than conceptual.

As an example of work that needs to be done,
consider Y(t) = a Y(t-1) + noise.
I do not see how Jonathan and Matthew propose
that we understand this "model" (small "m").

What is noise? If noise = noise(t) is N(0,\sigma^2) independent. Then, for a fixed value of Y(0), the distribution of
(Y(1), \dots, Y(n)) is described completely by the tuple (a, \sigma^2, Y(0)). The mapping that takes
this tuple and produces this jointly normal distribution is a statistical model.

Jonathan and Matthew are raising a question that I reinterpret
(so that I can try to understand it) as, why not have a
separate Design class that has a fit method that takes Y as an argument.
(That is, something like earlier incarnations of the Model class.)

Once you provide Y you can make a Model instance.
So the reason for doing what they want
must be computational, not conceptual.
(This seems obvious to me; am I missing something?)

Yes, it is computational -- we don't want to spend the overhead of creating 100K Model instances with the same X.

Specifically, iiuc, they are trying to avoid repeatedly
manipulating the design matrix in the same way (e.g.,
repeatedly producing the projection matrix), because
they want to use the same design with a very large
number of response variables.

Yes.

Finding a nice way to accommodate this is reasonable,
but as Josef (iirc) has pointed out, it seems likely that
only a small subset of estimation classes imply an
obvious way to accommodate that need.

Yes, only the linear estimation classes WLS, GLS and OLS should
reasonably try to do this.

Jonathan

Jonathan Taylor

unread,

Apr 19, 2011, 3:19:58 AM4/19/11

to pystatsmodels

I'm still curious as to your (and Jonathan's) response
to my last email on this topic. And in particular
I remain curious as to how you two describe the simple case

Y(t) = f(Y(t-1),theta) + e(t)

What is the "model" here that does not require the "response"
in the constructor?

In this model, I don't know what is considered known and unknown, but let's assume
that the e(t) are IID N(0,\sigma^2) and f is a known function and we observe n time points (Y(1), ..., Y(n)).
Then, the statistical model
is a function that takes a tuple (theta, sigma, Y(0)) and returns the joint distribution of (Y(1), ..., Y(n))
on R^n.

If this joint distribution has a density with respect to a known measure (which will probably be Lebesgue
if the errors where N(0,\sigma^2)) then we can describe the model by this density, i.e. such a statistical model
is a mapping that takes a tuple (theta, sigma, Y(0)) and returns a density on R^n

The ideal constructor, if we are to follow McCullagh's model, should therefore return a mapping that takes
a 3-tuple and returns a density. This is the statistical model.

In order to find the MLE, one maximizes this function over (Y(0),theta, sigma) for a given ndarray: [Y(1),..., Y(n)].
This is the point where the response Y is needed. This is a function of the statistical model and an ndarray.

In theory, then, we might consider the following to be the only function needed to fit any likelihood based model:

def MLE(model, Y):
return scipy.optimize.fmin_powell(lambda theta: model.density(theta)(Y))

and the model would never have to see Y. Of course, that's not going to work very well...

Pragmatically, though, I do agree that it is easier to manage the state of the model objects if Y is given to __init__. I think
I would be happy if the OLS, GLS and WLS classes could take Y of shape (n,q), say. Since we typically will
load in data in chunks, we can perhaps use Skipper's copy suggestion and use a new instance on every chunk.

-- Jonathan

Reply all

Reply to author

Forward