upfront
I just realized that I never replied at the end of this thread. This
was the beginning of my reply
------------
Late response. I didn't want to send my first response and then got
busy with first week of class and needing to read up on GMM again.
I largely agree with the description of the problem. We need more
maintainers of subpackages and addons.
If supporting and advertising separate addons helps, I'm all for it,
especially in the long run.
This essentially the same reason as the introduction of scikits to
support methods that cannot be added (yet) to scipy.
I don't think statsmodels will get an installer like R or Stata, since
there are the usual python channels and package distributors.
scipy-central might also become a good place after the current rewrite
and extensions.
But we would be able to help with search and advertising.
-----------------
and then I got distracted again. We had some follow-up threads that
were discussing related topics.
My main answers to this are: I don't know, and It needs a champion.
Building a new infrastructure for related packages requires work or a
template. And I haven't seen anything from pandas or scikit-learn
which had the discussion about it also, I also didn't search for any
details.
This will only happen if someone comes up with a feasible
infrastructure and someone implements it.
(I maintained a "related packages" list in the statsmodels
documentation for a while early on in the life of statsmodels, but it
got eventually deleted because it was outdated and nobody kept it up
to date.)
If there is work in this area, then statsmodels can support it in
terms of adjustment to the infrastructure, documentation and so on.
I'm pointing user in mailing lists or stackoverflow to other packages
that I know and where statsmodels doesn't cover the functionality yet,
expecially mlfit for nonlinear least squares, and to Kevin's package
for GARCH.
And there are various scikits on specific topics, e.g. bootstrap,
bootstrap is also available in Kevin's GARCH package.
My personal view is still to try to integrate things into statsmodels.
One issue in the current python package distribution system is that
circular dependencies across packages cause problems in many
distribution channels. We had this for a while with pandas
(statsmodels is an optional dependency of pandas).
The trend in my refactoring of statsmodels in the last year has been
much more focused on tighter control and consistency of subpackages,
which goes for the core model outside of tsa, linear models, discrete
models, GLM and RLM into the opposite direction of separating
packages. This means that we can write new functionality (more
statistical tests, imputation, and similar), in a generic way that
works and can be immediately tested for several model categories.
The separation and modularization of the components of a model, that
Vincent pointed out, is independent of this, but refactoring in that
area is very slow.
Three examples:
Robust covariance matrices were added in a set of generic functions.
Models that provide scoreobs and hessian or something equivalent can
immediately take advantage of it. I integrated it in the `fit` method
as in Stata. R and Julia users have to look for the corresponding
"sandwich" packages which are not integrated with the models directly
(although both R and Julia allow dispatch).
Chad's new statespace models and Kalman filter can be reused for
regression models directly, without installing a separate statespace
package.
Kerby's GEE reused our GLM families and link functions, but we can
reuse the additional covariance structures that GEE brought with it
(although that hasn't happened yet). Kerby is also pushing other code,
like prediction and plot functions into statsmodels, that can be
reused across model categories instead of writing them for a specific
category of models in "his" packages.
That's my focus and priorities. It's not exclusive, but it's where
**my** time goes.
My overall vision is still the Stata pattern with a core package that
can include and merge contributed packages. (Stata documentation has
often the comment at the end that the function is based on or was
originally written as user package.)
I also think we should get more attribution of code to the authors in
statsmodels. We never discussed a policy for this and it is as
unsystematic as in scipy. One problem is that some older modules have
been worked on by many contributors so it's difficult to assign
authorship. Other modules are still dominated by one author. The
second is that I don't think it will increase the involvement of the
original authors by a large amount. One typical pattern for
contributions is that they are written by a PhD student who has two
years available for statsmodels before hitting "real life" which
limits any future maintenance time. Better attribution can and should
still happen independently of this.
Finally,
Someone just needs to implement these ideas.
We are getting a good amount of new contributions, and bugfixes and
improvements in current code.
However, Skipper got a job that leaves him very little time for
statsmodels development and maintenance. Kevin contributed several
important infrastructure and continuous integration improvements,
among them the conversion to a common py2/py3 codebase (besides his
contribution to the model code).
But, there is still a gap. Also, Skipper was the only or main
developer of the interface with pandas and the data handling and the
interface with patsy's formulas.
I like to contribute to and design statsmodels, review code, do Q&A
and spend my time understanding the models. But I'm a mediocre
community organizer, and if I look at statsmodels and see mostly
maintenance and organizational or infrastructure tasks, then it is
difficult to stay motivated to spend most of my available time on it.
Help Wanted.
PS:
(In spite of the slowish pace, and PR's that have to wait too long for a merge.)
I think statsmodels is great. And we are going in a good direction,
both in terms of adding new models and in terms of adding all the
bells and whistles to the existing model (prefix and postestimation
commands in Stata terminology).
Josef