Coordinate VAR model implementation

124 views
Skip to first unread message

Wes McKinney

unread,
Dec 6, 2010, 1:03:41 AM12/6/10
to pystat...@googlegroups.com
Hey Skipper / Josef,

I'm starting to finally reach the point where I'm going to start
cannibalizing / refactoring some of the existing VAR code.

You can see what I have here, not too much outside of basic stable
VAR(p) estimation and today coded up a bunch of impulse
response-related things. Planning to add lots of bells and whistles
like nice plots with error bars (for IRFs, etc.). But just wanted to
make you aware for planning future developments:

http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/annotate/head:/scikits/statsmodels/sandbox/tsa/varwork.py

My basic plan from here

- Stable VAR(p) estimation with bells and whistles with lag selection,
standard tests for normality, whiteness of residuals, etc.
- VECM estimation, incorporate your guys' existing code related to
cointegrated time series (unit root tests, etc.)

I s'pose I'll create a subpackage in tsa eventually since this will
quickly balloon into several thousand lines of code.

Later: Optimized dynamic versions of the above (recursive
out-of-sample forecasting, etc.), SVAR/SVECM, Bayesian estimators,
panel versions, etc.

Cheers,
Wes

josef...@gmail.com

unread,
Dec 6, 2010, 10:53:09 AM12/6/10
to pystat...@googlegroups.com
On Mon, Dec 6, 2010 at 1:03 AM, Wes McKinney <wesm...@gmail.com> wrote:
> Hey Skipper / Josef,
>
> I'm starting to finally reach the point where I'm going to start
> cannibalizing / refactoring some of the existing VAR code.
>
> You can see what I have here, not too much outside of basic stable
> VAR(p) estimation and today coded up a bunch of impulse
> response-related things. Planning to add lots of bells and whistles
> like nice plots with error bars (for IRFs, etc.). But just wanted to
> make you aware for planning future developments:
>
> http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/annotate/head:/scikits/statsmodels/sandbox/tsa/varwork.py

Thanks Wes, splitting up the code in many smaller methods makes it
quite easy to read.

I'm trying to figure out for an hour now, what I should think of the
structure. I like the extra classes, for example for impuls responce
function, and the methods, adding plots makes it very quick to get an
overview. But I think the estimation is to tightly integrated with the
VARProcess, and some parts like estimation in the __init__ instead of
a fit() method is not consistent with the current pattern.

As an example:

>>> s = VARSummary(est)
>>> s._stats_table()
Traceback (most recent call last):
File "<pyshell#32>", line 1, in <module>
s._stats_table()
File "C:\Josef\eclipsegworkspace\statsmodels-wesm\scikits\statsmodels\sandbox\tsa\varwork.py",
line 1444, in _stats_table
part2Ldata = [[self.neqs],[self.nobs],[self.llf],[self.aic]]
AttributeError: 'VARSummary' object has no attribute 'neqs'

aic, bic, fpe, tests on parameters should be in a generic Results
class, so that it can be inherited by different estimators in the same
group of models, instead of writing it for each model separately. I
was a bit unhappy at the beginning of working on stats.models that
Result classes are so important, but by now I think the advantages for
code reuse through inheritance are great.

What's not so clear yet to me is what the best class for working with
time series data is. It's currently all in your VAR, but because the
model and estimation results are tightly integrated, I think, it will
be more difficult to extend it and inherit from it.
For many other models, the main point is the results class, with
estimated parameters and tests for it. With tsa there is a lot more
emphasis on forecasting, so the results class is not the main focus,
but still useful to collect the statistical results from the
estimation and the corresponding tests.

>
> My basic plan from here
>
> - Stable VAR(p) estimation with bells and whistles with lag selection,
> standard tests for normality, whiteness of residuals, etc.
> - VECM estimation, incorporate your guys' existing code related to
> cointegrated time series (unit root tests, etc.)
>
> I s'pose I'll create a subpackage in tsa eventually since this will
> quickly balloon into several thousand lines of code.

I would prefer if you could add some (semi-)generic bells and whistles
to the corresponding modules, stattools, tsatools and graphics. This
would keep reusable code better organized.

I have more comments on some details later, but I'm out of time right now.

Can you protect pandas tester with a try.. except ImportError, my
pandas install seems to be to old?

Two more general questions as a background:
How much is MLE used for VAR? I don't think I have seen it much yet,
without having specifically looked for it.
What's the dimension for VAR, (nobs, nvars), that would be the use
case of you (former) company?

Cheers,

Josef

Wes McKinney

unread,
Dec 6, 2010, 10:56:26 PM12/6/10
to pystat...@googlegroups.com
On Mon, Dec 6, 2010 at 10:53 AM, <josef...@gmail.com> wrote:
> On Mon, Dec 6, 2010 at 1:03 AM, Wes McKinney <wesm...@gmail.com> wrote:
>> Hey Skipper / Josef,
>>
>> I'm starting to finally reach the point where I'm going to start
>> cannibalizing / refactoring some of the existing VAR code.
>>
>> You can see what I have here, not too much outside of basic stable
>> VAR(p) estimation and today coded up a bunch of impulse
>> response-related things. Planning to add lots of bells and whistles
>> like nice plots with error bars (for IRFs, etc.). But just wanted to
>> make you aware for planning future developments:
>>
>> http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/annotate/head:/scikits/statsmodels/sandbox/tsa/varwork.py
>
> Thanks Wes, splitting up the code in many smaller methods makes it
> quite easy to read.
>
> I'm trying to figure out for an hour now, what I should think of the
> structure. I like the extra classes, for example for impuls responce
> function, and the methods, adding plots makes it very quick to get an
> overview. But I think the estimation is to tightly integrated with the
> VARProcess, and some parts like estimation in the __init__ instead of
> a fit() method is not consistent with the current pattern.

My focus at the moment is just getting things implemented in a simple
and readable manner-- small, easy to read functions (I usually think a
function over 20 or 30 lines needs to be refactored), with as much
reusability as possible. Class structure / etc. are important but want
to get those things hammered down first.

> As an example:
>
>>>> s = VARSummary(est)
>>>> s._stats_table()
> Traceback (most recent call last):
>  File "<pyshell#32>", line 1, in <module>
>    s._stats_table()
>  File "C:\Josef\eclipsegworkspace\statsmodels-wesm\scikits\statsmodels\sandbox\tsa\varwork.py",
> line 1444, in _stats_table
>    part2Ldata = [[self.neqs],[self.nobs],[self.llf],[self.aic]]
> AttributeError: 'VARSummary' object has no attribute 'neqs'

Sorry copy-pasted and refactored some of Skipper's code and didn't fix
that portion yet

> aic, bic, fpe, tests on parameters should be in a generic Results
> class, so that it can be inherited by different estimators in the same
> group of models, instead of writing it for each model separately. I
> was a bit unhappy at the beginning of working on stats.models that
> Result classes are so important, but by now I think the advantages for
> code reuse through inheritance are great.
>
> What's not so clear yet to me is what the best class for working with
> time series data is. It's currently all in your VAR, but because the
> model and estimation results are tightly integrated, I think, it will
> be more difficult to extend it and inherit from it.
> For many other models, the main point is the results class, with
> estimated parameters and tests for it. With tsa there is a lot more
> emphasis on forecasting, so the results class is not the main focus,
> but still useful to collect the statistical results from the
> estimation and the corresponding tests.

Yeah, not clear to me that the results class is the best structure for
tsa. As I go about implementing things I hope to get a better idea of
what would be a better structure.

>>
>> My basic plan from here
>>
>> - Stable VAR(p) estimation with bells and whistles with lag selection,
>> standard tests for normality, whiteness of residuals, etc.
>> - VECM estimation, incorporate your guys' existing code related to
>> cointegrated time series (unit root tests, etc.)
>>
>> I s'pose I'll create a subpackage in tsa eventually since this will
>> quickly balloon into several thousand lines of code.
>
> I would prefer if you could add some (semi-)generic bells and whistles
> to the corresponding modules, stattools, tsatools and graphics. This
> would keep reusable code better organized.

I plan to move out utilities and reusable functions (e.g. plotting,
vec/vech-realted functions, etc.)

> I have more comments on some details later, but I'm out of time right now.
>
> Can you protect pandas tester with a try.. except ImportError, my
> pandas install seems to be to old?

OK-- didn't intend for public consumption but will be a bit sensitive
;) I have a function to read in the Lutkepohl data files but it uses
pandas...

> Two more general questions as a background:
> How much is MLE used for VAR? I don't think I have seen it much yet,
> without having specifically looked for it.
> What's the dimension for VAR, (nobs, nvars), that would be the use
> case of you (former) company?

I've skirted around using MLE for the moment-- I don't have a good
idea how important it is. I'd like the model classes to be equipped to
handle in the 10s of variables with 1000s of observations. Of course
in higher dimensional problems some different tools might be needed. I
will likely be working on some (Bayesian) VAR-type models for high
dimensional data in the relatively near future, some of which could be
incorporated into statsmodels...

Thanks for looking, I will have more time over the holidays to flesh things out.

josef...@gmail.com

unread,
Dec 7, 2010, 12:28:31 AM12/7/10
to pystat...@googlegroups.com

It's no problem to have unfinished code in the sandbox, my point was
that often we should get this for free through inheritance (in a
result class) instead of copy-paste.


>
>> aic, bic, fpe, tests on parameters should be in a generic Results
>> class, so that it can be inherited by different estimators in the same
>> group of models, instead of writing it for each model separately. I
>> was a bit unhappy at the beginning of working on stats.models that
>> Result classes are so important, but by now I think the advantages for
>> code reuse through inheritance are great.
>>
>> What's not so clear yet to me is what the best class for working with
>> time series data is. It's currently all in your VAR, but because the
>> model and estimation results are tightly integrated, I think, it will
>> be more difficult to extend it and inherit from it.
>> For many other models, the main point is the results class, with
>> estimated parameters and tests for it. With tsa there is a lot more
>> emphasis on forecasting, so the results class is not the main focus,
>> but still useful to collect the statistical results from the
>> estimation and the corresponding tests.
>
> Yeah, not clear to me that the results class is the best structure for
> tsa. As I go about implementing things I hope to get a better idea of
> what would be a better structure.

I'm looking forward to hear your opinion on this.

>
>>>
>>> My basic plan from here
>>>
>>> - Stable VAR(p) estimation with bells and whistles with lag selection,
>>> standard tests for normality, whiteness of residuals, etc.
>>> - VECM estimation, incorporate your guys' existing code related to
>>> cointegrated time series (unit root tests, etc.)
>>>
>>> I s'pose I'll create a subpackage in tsa eventually since this will
>>> quickly balloon into several thousand lines of code.
>>
>> I would prefer if you could add some (semi-)generic bells and whistles
>> to the corresponding modules, stattools, tsatools and graphics. This
>> would keep reusable code better organized.
>
> I plan to move out utilities and reusable functions (e.g. plotting,
> vec/vech-realted functions, etc.)
>
>> I have more comments on some details later, but I'm out of time right now.
>>
>> Can you protect pandas tester with a try.. except ImportError, my
>> pandas install seems to be to old?
>
> OK-- didn't intend for public consumption but will be a bit sensitive
> ;) I have a function to read in the Lutkepohl data files but it uses
> pandas...

I didn't have problems with that, and I think it's only for the examples.
It's not yet for public consumption, but I like to "play" to see how it works.

>
>> Two more general questions as a background:
>> How much is MLE used for VAR? I don't think I have seen it much yet,
>> without having specifically looked for it.
>> What's the dimension for VAR, (nobs, nvars), that would be the use
>> case of you (former) company?
>
> I've skirted around using MLE for the moment-- I don't have a good
> idea how important it is. I'd like the model classes to be equipped to
> handle in the 10s of variables with 1000s of observations. Of course
> in higher dimensional problems some different tools might be needed. I
> will likely be working on some (Bayesian) VAR-type models for high
> dimensional data in the relatively near future, some of which could be
> incorporated into statsmodels...

dynamic factor models or factor augmented VAR might then be also
interesting. I don't find the right reference right now (one by Bai,
Ng), but this one got recently published and shows the main idea and
has the main players in the references
http://ideas.repec.org/p/cpr/ceprdp/7098.html

Cheers,

Josef

Skipper Seabold

unread,
Dec 14, 2010, 5:35:35 PM12/14/10
to pystat...@googlegroups.com
Having a look now, as I find some time with the semester ending, and I
am returning to VAR as soon as I clean up the ARMA tests (which
shouldn't need much more work).

I also am not completely sure about the inheritance structure of VAR
itself. As I recall, I ended up repeating a lot of what I did for
SUR, and I also recall that many statistical packages' VAR
implementations depend on SUR estimation
(http://www.stata.com/help.cgi?var). In the end, it's probably going
to come down to what gets done first (there is a working but probably
not optimal implementation of SUR and simultaneous equations stuff in
the sandbox/sysreg.py), but DRY might suggest leveraging SUR for VAR
and possibly for panel data as well. I don't know yet but will think
when I get back into the details.

>>>
>>> My basic plan from here
>>>
>>> - Stable VAR(p) estimation with bells and whistles with lag selection,
>>> standard tests for normality, whiteness of residuals, etc.
>>> - VECM estimation, incorporate your guys' existing code related to
>>> cointegrated time series (unit root tests, etc.)
>>>

If you start doing cointegration tests (and possibly others can't
recall, but the test statistics where pretty general), I have the up
to date tables for critical values etc. from

MacKinnon, J.G. 1994 "Approximate Asymptotic Distribution Functions for
Unit-Root and Cointegration Tests." Journal of Business & Economics
Statistics, 12.2, 167-76.
MacKinnon, J.G. 2010. "Critical Values for Cointegration Tests."
Queen's University, Dept of Economics Working Papers 1227.
http://ideas.repec.org/p/qed/wpaper/1227.html

In tsa/adfvalues.py. Let me know if it doesn't make sense.

>>> I s'pose I'll create a subpackage in tsa eventually since this will
>>> quickly balloon into several thousand lines of code.
>>
>> I would prefer if you could add some (semi-)generic bells and whistles
>> to the corresponding modules, stattools, tsatools and graphics. This
>> would keep reusable code better organized.
>
> I plan to move out utilities and reusable functions (e.g. plotting,
> vec/vech-realted functions, etc.)
>
>> I have more comments on some details later, but I'm out of time right now.
>>
>> Can you protect pandas tester with a try.. except ImportError, my
>> pandas install seems to be to old?
>
> OK-- didn't intend for public consumption but will be a bit sensitive
> ;) I have a function to read in the Lutkepohl data files but it uses
> pandas...

One quick note. varwork.py is not import safe. I try to put all the
script stuff in a if __name__ == "__main__" part. I often work in the
source directory when it's pure python. Right now I tried

from varwork import vech

to compare to my friend's implementation of the same. And I get

/home/skipper/statsmodels/statsmodels-wesm/scikits/statsmodels/sandbox/tsa/varwork.py
in <module>()
1534
1535 path = 'scikits/statsmodels/sandbox/tsa/data/%s.dat'
-> 1536 sdata, dates = parse_data(path % 'e1')
1537
1538 names = sdata.dtype.names

/home/skipper/statsmodels/statsmodels-wesm/scikits/statsmodels/sandbox/tsa/varwork.py
in parse_data(path)
51
52 regex = re.compile('<(.*) (\w)([\d]+)>.*')
---> 53 lines = deque(open(path))
54
55 to_skip = 0

IOError: [Errno 2] No such file or directory:
'scikits/statsmodels/sandbox/tsa/data/e1.dat'

I can live with the pandas dependency, but it'd be nice if it was in a
try/except loop. I always forget to do this for external packages,
but end up having to go back and do it when I change machines.

>
>> Two more general questions as a background:
>> How much is MLE used for VAR? I don't think I have seen it much yet,
>> without having specifically looked for it.
>> What's the dimension for VAR, (nobs, nvars), that would be the use
>> case of you (former) company?
>
> I've skirted around using MLE for the moment-- I don't have a good
> idea how important it is. I'd like the model classes to be equipped to
> handle in the 10s of variables with 1000s of observations. Of course
> in higher dimensional problems some different tools might be needed. I
> will likely be working on some (Bayesian) VAR-type models for high
> dimensional data in the relatively near future, some of which could be
> incorporated into statsmodels...
>

From what I've seen, not many people worry about MLE with VARs, and if
they do it's nominally conditional MLE (ie., OLS). The KF could be
employed if needed (Don't worry, I don't have any short run plans to
get lost down this rabbit hole, but might over the course of
dissertation work).

> Thanks for looking, I will have more time over the holidays to flesh things out.
>

Keep me posted. I've started on the Bayesian VAR with dummy
observation priors, but I'd like to have the basic VAR/SVAR down
before I start adding this in to the code base, so I will be looking
over your code as well.

I also have some not yet cleaned up of structural VAR that work for
recursive identification and long-run restrictions (Blanchard-Quah)
that I will try to push (or send you when you start on this).

Skipper

Wes McKinney

unread,
Jan 30, 2011, 10:29:38 AM1/30/11
to pystat...@googlegroups.com

Returning to this now-- I'd like to merge my branch into 0.3-devel
before things get too out of hand. I've organized all the VAR code
into a module under tsa:

http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/files/2034?file_id=var-20110121195501-8c6bdua3kvjkv1g4-1

I also moved the univariate AR model code out of tsa/var.py and into
tsa/ar.py-- can be moved elsewhere to your taste :) What's left of the
original var.py is in var/alt.py.

My plan at the moment is to get all this code working, tested, and
ready for the 0.3 release. Some docs might be nice, too. A little
refactoring to the module structure (and changes to variable names for
consistency with the rest of the package) is still in order but should
be fairly easy.

Not sure what's the easiest way-- are you guys comfortable giving me
push privileges on devel?

- Wes

Wes McKinney

unread,
Jan 30, 2011, 1:15:11 PM1/30/11
to pystat...@googlegroups.com

As an aside, I've had good luck recently using statsmodels in
"develop" mode (via setuptools):

python setup.py develop

popular among matplotlib developers also...

Skipper Seabold

unread,
Jan 30, 2011, 1:25:52 PM1/30/11
to pystat...@googlegroups.com
On Sun, Jan 30, 2011 at 1:15 PM, Wes McKinney <wesm...@gmail.com> wrote:
> As an aside, I've had good luck recently using statsmodels in
> "develop" mode (via setuptools):
>
> python setup.py develop
>
> popular among matplotlib developers also...
>

Hmm, something is different between your branch and devel, but I don't
see what just looking quickly. I can't import yours after a full
install, but I can if I use develop. A few more comments in a bit.


In [1]: import scikits.statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
/home/skipper/<ipython-input-1-4772e21c91dd> in <module>()
----> 1 import scikits.statsmodels.api as sm

/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0dev-py2.6.egg/scikits/statsmodels/api.py
in <module>()
3 import regression
4 from regression.linear_model import OLS, GLS, WLS, GLSAR
----> 5 from scikits.statsmodels.glm.glm import GLM
6 import scikits.statsmodels.glm.families as families
7 import robust

ImportError: No module named statsmodels.glm.glm

Skipper

Skipper Seabold

unread,
Jan 30, 2011, 1:42:49 PM1/30/11
to pystat...@googlegroups.com

The api.py imports need to be relative.

Skipper

Skipper Seabold

unread,
Jan 30, 2011, 2:52:45 PM1/30/11
to pystat...@googlegroups.com
On Sun, Jan 30, 2011 at 10:29 AM, Wes McKinney <wesm...@gmail.com> wrote:
> Returning to this now-- I'd like to merge my branch into 0.3-devel
> before things get too out of hand. I've organized all the VAR code
> into a module under tsa:
>
> http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/files/2034?file_id=var-20110121195501-8c6bdua3kvjkv1g4-1
>

A few scattered comments.

The decorators are in the tools directory now but aren't imported as
such in var/model.py.

Should we go ahead and make matplotlib a hard or soft dependency?
Same with pandas. This may have already been discussed. Either is
fine with me.

Should we try to follow the var/model.py pattern elsewhere in the
code? It might be nice for organization if it's used consistently.

What's e1.dat? Can you do the examples with the macrodata dataset or
include e1?

Do you think you can pull some of the tests that are methods out of
VAREstimator and have them available as functions? If you want to
attach them to the model, we talked about using mix-in classes, but I
don't think we've implemented it in too many places yet. If we do
start the mix-in patterns, should we separate the tests from the
results so that it's something like
ModelResults.tests.test_causality() or something. Also, I see that
you do the actual fitting in VAREstimator. For the most part, we have
been doing the fitting in the model's fit method and then returning a
ModelResults class that holds the results. Do we want to stay
consistent?

I also keep a reference to the original model in a 'model' attribute
in the results class.

> I also moved the univariate AR model code out of tsa/var.py and into
> tsa/ar.py-- can be moved elsewhere to your taste :) What's left of the

That's fine. I also think that yule_walker should be moved there too.

> original var.py is in var/alt.py.
>

Did you reuse any of the code from VAR2 or VARMAResults? You could
get the loglikelihood from there. Once the results are settled across
classes, we can delete alt.py.

> My plan at the moment is to get all this code working, tested, and
> ready for the 0.3 release. Some docs might be nice, too. A little
> refactoring to the module structure (and changes to variable names for
> consistency with the rest of the package) is still in order but should
> be fairly easy.

Can you make notes about what conventions you use for the information
criteria. Did you follow Lutkepohl mainly? It looks like you want
for the information criteria

from scikits.statsmodels.tools.compatibility import np_slogdet
ld = np_slogdet(sigma_mle)[1]

Also, these look to be slightly different than the ones I have (not
sure that mine are correct, haven't revisited in a while) and none of
ours match up with Stata, gretl, or R. For the XX dataset I have in
the __main__ of alt.py

Stata
w/ lutstats
AIC = -27.95936
HQIC = -27.83923
SBIC = -27.66251
FPE = 7.42e-13

w/out lutstats
AIC = -19.41573
HQIC = -19.27557
SBIC = -19.0694
FPE = 7.42e-13


Gretl
AIC = -19.4157
BIC = -19.0694
HQC = -19.2756

R
AIC(n) -2.792934e+01
HQ(n) -2.778919e+01
SC(n) -2.758302e+01
FPE(n) 7.421288e-13

Yours
aic -27.719339439671284
hqic -27.439035936971873
bic -27.026692792696199
fpe -28.139339439671286


Mine
aic -28.139339439671282
hqic -27.789187688321576
bic -27.583016116183739
fpe 6.0150626755719212e-13

Usually with any of these, I find that the differences are in the
assumptions about the degrees of freedom or propagated from small
differences in the likelihood. Ie., counting the constant/trend, etc.
For the most part, I have been doing, if you estimate something, you
use a degree of freedom and the "complexity" of the model increases
and should be penalized. In any case, we just need to pick a
convention and document it (even if it's a source code comment, which
I usually end up doing).

>
> Not sure what's the easiest way-- are you guys comfortable giving me
> push privileges on devel?
>

Fine by me. I think Josef has to do this.

> - Wes
>

josef...@gmail.com

unread,
Jan 30, 2011, 3:05:25 PM1/30/11
to pystat...@googlegroups.com

I'm looking at the changes and was preparing for a merge. But I think
the way Skipper's var/ar code has been moved or copied around might
have lost all connection to the history.

Merging the new files is not a problem, but I would like to see how
far the history got disconnected before merging.

Maybe it's just because bzr disconnect one part of the history after a
file split.

Josef

Wes McKinney

unread,
Jan 30, 2011, 3:39:05 PM1/30/11
to pystat...@googlegroups.com

The history's all there, I think, here is the bzr mv on the original var.py:

http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/revision/2031

I just pushed a few more commits actually, should be showing up nowish.

I don't know what effect the refactoring will have on the merge.
Probably will not be pretty. If you're willing to hand me the keys to
the car (push privileges) I'd be happy to handle the merge and I'll be
racing to reconcile the results of the current code and get test
coverage.

I'll address Skipper's comments a bit later.

josef...@gmail.com

unread,
Jan 30, 2011, 4:13:00 PM1/30/11
to pystat...@googlegroups.com

The problem is that annotate for both ar.py (because of accidental
deletion) and alt.py (because of bzr) have no history prior to your
copying.

I don't think there is any problem with the code, I'm just a fan of annotate.
For scipy I like trac and svn because annotate with follow on copy
gives me immediately 10 or 15 years of history, which is very useful.

Josef

Wes McKinney

unread,
Jan 30, 2011, 4:36:51 PM1/30/11
to pystat...@googlegroups.com

Ah...seems like a failing on bzr's part then. alt.py was always
intended to be entirely temporary. If you'd like to do the
restructuring my "cleanly" then please go ahead-- I can then base my
branch off of the clean history. Whichever way, just let me know.

Wes McKinney

unread,
Jan 30, 2011, 4:45:04 PM1/30/11
to pystat...@googlegroups.com
On Sun, Jan 30, 2011 at 2:52 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Sun, Jan 30, 2011 at 10:29 AM, Wes McKinney <wesm...@gmail.com> wrote:
>> Returning to this now-- I'd like to merge my branch into 0.3-devel
>> before things get too out of hand. I've organized all the VAR code
>> into a module under tsa:
>>
>> http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/files/2034?file_id=var-20110121195501-8c6bdua3kvjkv1g4-1
>>
>
> A few scattered comments.
>
> The decorators are in the tools directory now but aren't imported as
> such in var/model.py.

I'll blast my pyc files and that will probably turn up things like this.

> Should we go ahead and make matplotlib a hard or soft dependency?
> Same with pandas.  This may have already been discussed.  Either is
> fine with me.

I feel like matplotlib should be a hard-ish dependency. pandas could
be a soft dependency-- I think there are a lot of benefits to having
higher-level data structures within statsmodels. Honestly, I could
envision "merging" pandas into statsmodels eventually (and it might
actually be a really good idea). Maybe that's a bit ambitious right
now. And only if we drop the "scikits" to make imports easier ;)

> Should we try to follow the var/model.py pattern elsewhere in the
> code?  It might be nice for organization if it's used consistently.

True. I didn't know what to call it. I think consistency is better so
whatever you guys think.

> What's e1.dat?  Can you do the examples with the macrodata dataset or
> include e1?

I'm basing all the code off of macrodata going forward-- but the
Lutkepohl data sets (e.g. e1.dat) are in tsa/var/data.

> Do you think you can pull some of the tests that are methods out of
> VAREstimator and have them available as functions?  If you want to
> attach them to the model, we talked about using mix-in classes, but I
> don't think we've implemented it in too many places yet.  If we do
> start the mix-in patterns, should we separate the tests from the
> results so that it's something like
> ModelResults.tests.test_causality() or something.  Also, I see that
> you do the actual fitting in VAREstimator.  For the most part, we have
> been doing the fitting in the model's fit method and then returning a
> ModelResults class that holds the results.  Do we want to stay
> consistent?

Seems fine to me, easy enough refactoring.

> I also keep a reference to the original model in a 'model' attribute
> in the results class.
>
>> I also moved the univariate AR model code out of tsa/var.py and into
>> tsa/ar.py-- can be moved elsewhere to your taste :) What's left of the
>
> That's fine.  I also think that yule_walker should be moved there too.
>
>> original var.py is in var/alt.py.
>>
>
> Did you reuse any of the code from VAR2 or VARMAResults?  You could
> get the loglikelihood from there.  Once the results are settled across
> classes, we can delete alt.py.

Yeah, I'll use the loglike function from there, reconcile the results,
and we can delete it.

Yeah, I've been following Lutkepohl pretty explicitly. So the results
*should* match up with the vars R package which is-- as far as I can
tell-- 100% based on Lutkepohl.

Wes McKinney

unread,
Jan 30, 2011, 7:42:27 PM1/30/11
to pystat...@googlegroups.com

By the way, one problem with having a Results-only class versus an
estimation class is that it makes lazy evaluation of certain things
sort of difficult. But in this case it's not horribly onerous--
performance will be the main concern in the dynamic (time-varying
parameter) VAR class (yet to be coded).

Skipper Seabold

unread,
Jan 30, 2011, 8:31:40 PM1/30/11
to pystat...@googlegroups.com
On Sun, Jan 30, 2011 at 7:42 PM, Wes McKinney <wesm...@gmail.com> wrote:
> By the way, one problem with having a Results-only class versus an
> estimation class is that it makes lazy evaluation of certain things
> sort of difficult. But in this case it's not horribly onerous--
> performance will be the main concern in the dynamic (time-varying
> parameter) VAR class (yet to be coded).
>

Can you elaborate on this a little? I don't see why the model can't
serve as the estimator. How do you intend to do the estimation? I
have just been asked to write some code for a prof that does
time-varying parameters for a panel data model using the Kalman
filter, and I expect to do it much the same as the ARMA code, ie.,
have fit call an optimization maximizing a likelihood computed via the
KF then returns a results class. Am I missing something?

Skipper

Wes McKinney

unread,
Jan 30, 2011, 10:56:51 PM1/30/11
to pystat...@googlegroups.com

As I continue to refactor I'm finding what I wrote to be less true-- a
couple calculations need to be repeated but otherwise there's not much
loss. I'm fine with it :)

I'm working on getting code coverage now. Fixed up the loglike
function that was in alt.py (it wasn't quite complete?), should be
able to jettison that module soon. Probably won't be able to get it
all done tonight-- holding off on uploading the test data (generated
by tsa/var/tests/var.R and piped into Python via some rpy2 stuff I
hacked together) to avoid bloating the bzr repo much further.

Skipper Seabold

unread,
Jan 30, 2011, 11:14:08 PM1/30/11
to pystat...@googlegroups.com

What was missing from the likelihood (in VAR2)? I get

In [5]: res.llf
Out[5]: 1962.5708240443246

And in R (from var.R)

> print(logLik(mod),digits=16)
'log Lik.' 1962.570824044324 (df=NULL)

Skipper

Wes McKinney

unread,
Jan 30, 2011, 11:29:32 PM1/30/11
to pystat...@googlegroups.com

Nevermind-- you are quite right (haven't had enough coffee today). You
had an simpler formula actually-- I think it was some
degree-of-freedom issue that it was not matching up to R for me.

Here's the state of things at the moment, comparing with vars:

23:28 ~/code/statsmodels $ ./test_coverage.sh
test_var.TestIRF.test_plots ... ok
test_var.TestVARResults.test_acf ... ok
test_var.TestVARResults.test_acorr ... ok
test_var.TestVARResults.test_aic ... ok
test_var.TestVARResults.test_bic ... ok
test_var.TestVARResults.test_detsig ... ok
test_var.TestVARResults.test_fpe ... ok
test_var.TestVARResults.test_hqic ... ok
test_var.TestVARResults.test_irf_coefs ... ok
test_var.TestVARResults.test_is_stable ... ok
test_var.TestVARResults.test_loglike ... ok
test_var.TestVARResults.test_ma_rep ... ok
test_var.TestVARResults.test_nobs ... ok
test_var.TestVARResults.test_params ... ok
test_var.TestVARResults.test_plot_acorr ... ok
test_var.TestVARResults.test_plot_irf ... ok
test_var.TestVARResults.test_stderr ... ok

Name Stmts Miss Cover Missing
--------------------------------------------------------------------
scikits.statsmodels.tsa.var 0 0 100%
scikits.statsmodels.tsa.var.irf 137 93 32% 48, 51,
73-85, 97-111, 148-157, 170-188, 205-232, 241-251, 254, 259-278,
282-296, 299, 302
scikits.statsmodels.tsa.var.model 448 136 70% 98, 100,
128, 181-182, 190-207, 213-223, 226, 268-275, 281, 352-354, 405, 467,
503, 602, 626-629, 632-633, 676, 890-892, 901-902, 904-926, 934-938,
941, 968, 1000-1052, 1055, 1138, 1143, 1147, 1150-1153, 1162-1165,
1174, 1185-1187, 1190-1206, 1208-1220, 1225-1263
scikits.statsmodels.tsa.var.output 125 104 17% 64-65,
68, 71-90, 96-104, 107-128, 134-151, 154-179, 182-192, 195-214,
224-244, 248-275, 278-286
scikits.statsmodels.tsa.var.plotting 124 80 35% 33-48,
52-53, 57-72, 86-99, 133, 156-173, 176-192, 196-226
scikits.statsmodels.tsa.var.util 78 47 40% 24-29,
33, 38, 47-91, 123, 125, 132, 146-158
--------------------------------------------------------------------
TOTAL 912 460 50%
----------------------------------------------------------------------
Ran 17 tests in 1.554s

OK

josef...@gmail.com

unread,
Feb 1, 2011, 4:28:34 PM2/1/11
to pystat...@googlegroups.com
> resthttps://mail.google.com/mail/?hl=en&shva=1#label/pystatsmodels/12cba47986da2679ructuring my "cleanly" then please go ahead-- I can then base my

> branch off of the clean history. Whichever way, just let me know.
>

I did a selective merge on your changeset 2031 that deleted the AR
class, skipped 2032, and merged the rest. All files are the same as in
your branch, and when I tried in my copy of your branch, I was able to
merge back without any merge conflicts.
So I think it's safe enough to push this
lp:~josef-pktd/statsmodels/statsmodels-devel-tmp into devel.

One more question.

changeset
http://bazaar.launchpad.net/~wesmckinn/statsmodels/statsmodels-wesm/revision/2039
deleted all of Skipper's VAR classes when
scikits/statsmodels/tsa/var/alt.py was removed.

Is this all obsolete with the new VAR classes, or do we need to keep
(parts of) it?
(The module doesn't have a connection to it's history, so it can also
be recovered from the repository later on without any loss.)

Josef

Wes McKinney

unread,
Feb 1, 2011, 5:22:42 PM2/1/11
to pystat...@googlegroups.com

Yes-- I had a good look through and I believe it was all obsolete. Can
always be recovered from history if I missed something.

Wes McKinney

unread,
Feb 2, 2011, 9:41:18 PM2/2/11
to pystat...@googlegroups.com

If you think it's safe to push to devel, do you want to go ahead and
do it? I can then pull from devel and overwrite my branch before I
make any further developments.

- Wes

josef...@gmail.com

unread,
Feb 2, 2011, 11:06:32 PM2/2/11
to pystat...@googlegroups.com

Sorry for the late reply.

I'm giving up on trying to manipulate the bzr history. I will merge
your branch just the way it is.
I had found another problem and don't have the time or patience to
figure out whether it's possible to do what I wanted. We just loose
the annotate for one class, and if we need the full history, we can
always jump back in the log.

So you can just continue to work in your branch, and I will do the
merges (VAR and docs branch) tomorrow afternoon (after my teaching),
and then go over the related changes in the sphinx docs.

Josef

>
> - Wes
>

josef...@gmail.com

unread,
Feb 10, 2011, 8:07:49 PM2/10/11
to pystat...@googlegroups.com

Hi Wes,

Thanks for VAR.

I would like to get back to cleaning up, mainly docs and examples, tomorrow and
part of the weekend.

In the merge of your VAR I found that one file is missing,
resultspath + 'vars_results.npz'
when I was running the test suite.

File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va
r\tests\test_var.py", line 116, in __init__
data = np.load(resultspath + 'vars_results.npz')
File "C:\Programs\Python25\Lib\site-packages\numpy\lib\npyio.py", line 320, in
load
fid = open(file, "rb")
IOError: [Errno 2] No such file or directory: 'C:\\Josef\\eclipsegworkspace\\sta
tsmodels-devel3\\scikits\\statsmodels/tsa/var/tests/results/vars_results.npz'


Josef

>
> Josef
>
>>
>> - Wes
>>
>

Wes McKinney

unread,
Feb 10, 2011, 9:48:58 PM2/10/11
to pystat...@googlegroups.com
>
> Hi Wes,
>
> Thanks for VAR.
>
> I would like to get back to cleaning up, mainly docs and examples, tomorrow and
> part of the weekend.
>
> In the merge of your VAR  I found that one file is missing,
> resultspath + 'vars_results.npz'
> when I was running the test suite.
>
>  File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va
> r\tests\test_var.py", line 116, in __init__
>    data = np.load(resultspath + 'vars_results.npz')
>  File "C:\Programs\Python25\Lib\site-packages\numpy\lib\npyio.py", line 320, in
>  load
>    fid = open(file, "rb")
> IOError: [Errno 2] No such file or directory: 'C:\\Josef\\eclipsegworkspace\\sta
> tsmodels-devel3\\scikits\\statsmodels/tsa/var/tests/results/vars_results.npz'
>
>
> Josef
>
>>
>> Josef
>>
>>>
>>> - Wes
>>>
>>
>

<snip>

Thanks for pointing that out-- I was hesitant to check in that
(binary) file while I was working on the test suite to avoid bloating
the repository any further. I can add it for you right now, or I was
planning to wrap up testing this weekend and add anything additional
to it that's needed, whichever you prefer. Then can move forward with
implementing more VAR stuff-- probably focusing on dynamic
(time-varying) VAR for the immediate time being (this will eventually
dovetail with some Bayesian work on dynamic models I'll be doing in
the coming months too).

- Wes

Wes McKinney

unread,
Feb 13, 2011, 12:35:05 AM2/13/11
to pystat...@googlegroups.com

I just pushed a bunch of revisions to my branch along with the VAR
results file. I have about 100% line coverage on the current code--
I'll keep adding in more result tests as they come along. The basics
are all set.

I got a laugh out of the attached screenshot...kind of amazing to me
in the age of GitHub :)

Cheers,
Wes

bzr.png

josef...@gmail.com

unread,
Feb 13, 2011, 9:00:06 AM2/13/11
to pystat...@googlegroups.com

Great, 100% that makes it the highest coverage of any part in statsmodels.

I will merge it in the next few hours into devel.

>
> I got a laugh out of the attached screenshot...kind of amazing to me
> in the age of GitHub :)

(offtopic for this thread)

While python overall seems to have converged to hg, scientific python
looks settled in on github (and git as a consequence?)

I recently read some comments

http://lusislog.blogspot.com/2010/10/designed-for-developers-why-people-keep.html#disqus_thread
http://www.reddit.com/r/programming/comments/emhu3/why_people_keep_asking_you_to_use_github/

Eventually or soon we will catch up with some "network externalities"

Josef
>
> Cheers,
> Wes
>

josef...@gmail.com

unread,
Feb 13, 2011, 10:07:08 AM2/13/11
to pystat...@googlegroups.com

(even more offtopic)

Apropos network externalities: I think I need to switch to pure R

Windows slide in market share continuous, it has now only *almost* 90%
http://www.networkworld.com/community/blog/windows-drops-below-90-market-share

Statistics opensource is all in R, that's where the community and also
where the money is
http://www.theregister.co.uk/2011/02/07/revolution_r_sas_challenge/

rpy2 doesn't work on Windows

So the only conclusion is that I have to switch to using R

Cheers,

Josef

>
> Josef
>>
>> Cheers,
>> Wes
>>
>

Wes McKinney

unread,
Feb 13, 2011, 10:48:28 AM2/13/11
to pystat...@googlegroups.com

The GitHub siren is calling... it really does make a big difference in
collaboration with groups people.

>
> (even more offtopic)
>
> Apropos network externalities: I think I need to switch to pure R
>
> Windows slide in market share continuous, it has now only *almost* 90%
> http://www.networkworld.com/community/blog/windows-drops-below-90-market-share
>
> Statistics opensource is all in R, that's where the community and also
> where the money is
> http://www.theregister.co.uk/2011/02/07/revolution_r_sas_challenge/
>
> rpy2 doesn't work on Windows
>
> So the only conclusion is that I have to switch to using R
>
> Cheers,
>
> Josef
>
>>
>> Josef
>>>
>>> Cheers,
>>> Wes
>>>
>>
>

Or stop using Windows. The sooner the better IMHO =)

Indeed you have companies like Revolution Analytics wanting to carve
out a piece of the statistics business for themselves to the tune of
$1000/workstation (quite a lot less than what SAS charges). Since
academic statistics programs are R-heavy, users leaving PhD programs
all tend to be proficient in R.

But we're trying to change that, right? R is not a very good
language-- Python is! You can't build serious production systems in
R-- integrating R with the outside world is really quite painful
(though it appears that Rev. Analytics is working to solve this
problem...but you're still stuck with the R language). So we have to
build a compelling set of Python tools for doing statistics and,
hopefully, they will come.

Skipper Seabold

unread,
Feb 13, 2011, 10:54:20 AM2/13/11
to pystat...@googlegroups.com
On Sun, Feb 13, 2011 at 10:48 AM, Wes McKinney <wesm...@gmail.com> wrote:
> The GitHub siren is calling... it really does make a big difference in
> collaboration with groups people.
>

Since we're going OT,

I've gotten fairly comfortable with it lately
<https://github.com/jseabold/pymaclab>, and I'm convinced and ready to
move. First we should discuss the next release... Then I think we can
start the move. Things have been hectic around here, but I think I
could devote some time to both activities in the coming weeks.

Skipper

Wes McKinney

unread,
Feb 13, 2011, 11:45:10 AM2/13/11
to pystat...@googlegroups.com

Sure, maybe let's start a new thread re: 0.3 discussion. What needs to
happen for the 0.3 release? Can I help in any way?

josef...@gmail.com

unread,
Feb 13, 2011, 12:14:51 PM2/13/11
to pystat...@googlegroups.com

I will write a summary as I see it after lunch.

Josef

josef...@gmail.com

unread,
Feb 13, 2011, 2:28:14 PM2/13/11
to pystat...@googlegroups.com

after merging into devel I get some errors and failures
The matplotlib errors are because I have 0.99.1 and these are features
for 1.0 or higher. I haven't looked at the others yet

Josef

======================================================================
ERROR: test_var.TestVARResults.test_causality
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\programs\python25\lib\site-packages\nose-0.11.1-py2.5.egg\nose\case.p
y", line 183, in runTest
self.test(*self.arg)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\tests\test_var.py", line 328, in test_causality
result = self.res.test_causality(name, variables, kind='f')


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\model.py", line 922, in test_causality
eq_index = self.get_eq_index(equation)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\model.py", line 449, in get_eq_index
return util.get_index(self.names, name)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\util.py", line 165, in get_index
result = lst.index(name)
AttributeError: 'tuple' object has no attribute 'index'

======================================================================
ERROR: test_var.TestVARResults.test_fevd_plot
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\programs\python25\lib\site-packages\nose-0.11.1-py2.5.egg\nose\case.p
y", line 183, in runTest
self.test(*self.arg)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\tests\test_var.py", line 205, in test_fevd_plot
self.fevd.plot()


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\model.py", line 1104, in plot
fig, axes = plt.subplots(nrows=k, figsize=(10,10))
AttributeError: 'module' object has no attribute 'subplots'

======================================================================
ERROR: test_var.TestVARResults.test_get_eq_index
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\programs\python25\lib\site-packages\nose-0.11.1-py2.5.egg\nose\case.p
y", line 183, in runTest
self.test(*self.arg)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\tests\test_var.py", line 250, in test_get_eq_index
idx2 = self.res.get_eq_index(name)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\model.py", line 449, in get_eq_index
return util.get_index(self.names, name)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\util.py", line 165, in get_index
result = lst.index(name)
AttributeError: 'tuple' object has no attribute 'index'

======================================================================
ERROR: test_var.TestVARResults.test_lagorder_select
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\programs\python25\lib\site-packages\nose-0.11.1-py2.5.egg\nose\case.p
y", line 183, in runTest
self.test(*self.arg)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\tests\test_var.py", line 304, in test_lagorder_select
with assert_raises(Exception):
TypeError: failUnlessRaises() takes at least 3 arguments (2 given)

======================================================================
ERROR: test_var.TestVARResults.test_plot_cum_effects
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\programs\python25\lib\site-packages\nose-0.11.1-py2.5.egg\nose\case.p
y", line 183, in runTest
self.test(*self.arg)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\tests\test_var.py", line 189, in test_plot_cum_effects
self.irf.plot_cum_effects()


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\irf.py", line 120, in plot_cum_effects
plot_params=plot_params)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\plotting.py", line 160, in irf_grid_plot
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, sharex=True,
AttributeError: 'module' object has no attribute 'subplots'

======================================================================
ERROR: test_var.TestVARResults.test_plot_irf
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\programs\python25\lib\site-packages\nose-0.11.1-py2.5.egg\nose\case.p
y", line 183, in runTest
self.test(*self.arg)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\tests\test_var.py", line 177, in test_plot_irf
self.irf.plot()


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\irf.py", line 91, in plot
plot_params=plot_params)


File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\va

r\plotting.py", line 160, in irf_grid_plot
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, sharex=True,
AttributeError: 'module' object has no attribute 'subplots'

======================================================================
ERROR: scikits.statsmodels.tsa.tests.test_tsa_tools.test_duplication_matrix
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\programs\python25\lib\site-packages\nose-0.11.1-py2.5.egg\nose\case.p
y", line 183, in runTest
self.test(*self.arg)
File "C:\Josef\eclipsegworkspace\statsmodels-devel3\scikits\statsmodels\tsa\te
sts\test_tsa_tools.py", line 56, in test_duplication_matrix
assert(np.array_equal(vec(m), np.dot(D3, vech(m))))
ValueError: matrices are not aligned

----------------------------------------------------------------------
Ran 1172 tests in 290.063s

FAILED (SKIP=12, errors=7)

Wes McKinney

unread,
Feb 13, 2011, 3:28:12 PM2/13/11
to pystat...@googlegroups.com

I think it's safe to require matplotlib >= 1.0 now, EPD has had it for
multiple releases (I recommend upgrading to the new EPD with Python
2.7 by the way).

I'll take a look at the other test failure

Wes McKinney

unread,
Feb 13, 2011, 3:30:40 PM2/13/11
to pystat...@googlegroups.com

I should also point out that the recent EPD release takes care of the
concern I expressed over scipy.optimize failures a few weeks ago.

Wes McKinney

unread,
Feb 13, 2011, 4:29:49 PM2/13/11
to pystat...@googlegroups.com

Fixed a bug in the duplication_matrix function and made the unit test
a bit better, so that test should not fail now. Getting matplotlib 1.0
and a newer version of nose should fix the remaining errors you're
having.

josef...@gmail.com

unread,
Feb 13, 2011, 5:40:48 PM2/13/11
to pystat...@googlegroups.com

that's good

>>
>
> Fixed a bug in the duplication_matrix function and made the unit test
> a bit better, so that test should not fail now. Getting matplotlib 1.0
> and a newer version of nose should fix the remaining errors you're
> having.

I'm not using EPD or any other premade distribution, and I'm always at
the tailend of any updates.
Upgrading nose shouldn't be a problem, but I think we should make all
matplotlib tests conditional on the presence of the right matplotlib.
skipif similar to what we did at some point with rpy.

I don't know what the updating for various Linux distributions is, but
John Hunter recently mentioned that 1.x has not yet fully penetrated
the market.

I think matplotlib should be optional, and as lazily imported as
possible. It's another heavy import that we don't always want. (But a
fully loaded statsmodels is starting to look pretty good for
interactive work.)

Josef

Wes McKinney

unread,
Feb 13, 2011, 7:34:39 PM2/13/11
to pystat...@googlegroups.com

I think we should be encouraging people to use canned distributions
like EPD, *especially* on Windows where EPD is rigged up to build C
extensions with no hassle. I recommend you try it-- it's free to
academics like yourself ;) Even on Linux using EPD makes life
significantly more pleasant-- there's always Python(x,y) for
non-academic users who don't want to pay for EPD (but no 64-bit build
for Windows).

matplotlib import time is fortunately comparatively immaterial it
seems compared with scipy:

In [1]: %time import scipy.stats
CPU times: user 0.35 s, sys: 0.02 s, total: 0.37 s
Wall time: 0.38 s

In [2]: %time import matplotlib
CPU times: user 0.07 s, sys: 0.02 s, total: 0.09 s
Wall time: 0.08 s

I'm raising SkipTest for users who don't have matplotlib. It seems
silly to me to not target >= 1.x in our current developments.

> I don't know what the updating for various Linux distributions is, but
> John Hunter recently mentioned that 1.x has not yet fully penetrated
> the market.
>
> I think matplotlib should be optional, and as lazily imported as
> possible. It's another heavy import that we don't always want. (But a
> fully loaded statsmodels is starting to look pretty good for
> interactive work.)

I think long-term we should be pushing a statsmodels-based and
{charlton, pandas, etc. etc.}-based interactive environment as an
alternative to R and other statistical environments out there. I'm all
for making statsmodels a great library on its own, but the real
benefit will be the integrated research environment for data analysis.
I wouldn't actually be averse to "merging" pandas into statsmodels if
it seemed like the right thing to do. Since it seems like some kind of
final resolution on data structures isn't happening anytime soon, it
seems like we should take a working, robust solution (which may be
imperfect or not solve every problem under the sun) like pandas and
use it-- it's certainly not stopping a whole lot of other people out
there in industry from doing exactly that.

> Josef
>

josef...@gmail.com

unread,
Feb 13, 2011, 9:05:19 PM2/13/11
to pystat...@googlegroups.com

I don't think I would have counted as academic last year. I like my
personalized python, but I agree with the general recommendation of
distributions.

> matplotlib import time is fortunately comparatively immaterial it
> seems compared with scipy:
>
> In [1]: %time import scipy.stats
> CPU times: user 0.35 s, sys: 0.02 s, total: 0.37 s
> Wall time: 0.38 s

I'm very happy about this :(

>
> In [2]: %time import matplotlib
> CPU times: user 0.07 s, sys: 0.02 s, total: 0.09 s
> Wall time: 0.08 s

If this doesn't have matplotlib preloaded, then it's a lot faster now.
To me it feels more like seconds.


>
> I'm raising SkipTest for users who don't have matplotlib. It seems
> silly to me to not target >= 1.x in our current developments.

That's fine with me. I will upgrade soon, it just didn't make a
difference to me and by default I don't upgrade.

>
>> I don't know what the updating for various Linux distributions is, but
>> John Hunter recently mentioned that 1.x has not yet fully penetrated
>> the market.
>>
>> I think matplotlib should be optional, and as lazily imported as
>> possible. It's another heavy import that we don't always want. (But a
>> fully loaded statsmodels is starting to look pretty good for
>> interactive work.)
>
> I think long-term we should be pushing a statsmodels-based and
> {charlton, pandas, etc. etc.}-based interactive environment as an
> alternative to R and other statistical environments out there. I'm all
> for making statsmodels a great library on its own, but the real
> benefit will be the integrated research environment for data analysis.
> I wouldn't actually be averse to "merging" pandas into statsmodels if
> it seemed like the right thing to do. Since it seems like some kind of
> final resolution on data structures isn't happening anytime soon, it
> seems like we should take a working, robust solution (which may be
> imperfect or not solve every problem under the sun) like pandas and
> use it-- it's certainly not stopping a whole lot of other people out
> there in industry from doing exactly that.

I think statsmodels as backend for web applications is another
possible big area. The institute I was in last year switched to django
and numpy for their latest web application. There minimal load is an
advantage, and I guess it will remain this way.

I translated a GAUSS program for tests of markov switching models of
some friends (econometric theorists) to numpy and statsmodels.
Interactive work with matplotlib and statsmodels works very well and
there were only a few pieces that I was missing. Although doing
MonteCarlos on markov switching models without cython is slow.

It's also a future I would like to see, but becoming this big package
is still a lot of work to build and then to maintain. The new
statsmodels is not even two years old, and we still have a long way to
go until statsmodels 1.0, with lots of design decisions and
refactorings ahead of us.

But I hope we can keep a dual purpose, interactive and library,
without many sacrifices either way

Josef

>
>> Josef
>>
>

Wes McKinney

unread,
Feb 13, 2011, 9:43:34 PM2/13/11
to pystat...@googlegroups.com

I think the key will be getting more and more people involved-- I hope
that at minimum Skipper and I will be able to give a couple of talks
at SciPy on our progress and try to work towards enlisting the help of
others.

As far as a dual purpose library-- as long as we use the api.py format
for the "interactive" bit and are careful about not importing things
unnecessarily, I don't think that will be very difficult to achieve. I
do think that having statsmodels be something that can be used in
performance-sensitive production applications will be quite important!

Something we can discuss more in the coming months.

Reply all
Reply to author
Forward
0 new messages