R vs. Python+statsmodels

1,703 views
Skip to first unread message

Warren Weckesser

unread,
Aug 25, 2013, 11:16:17 PM8/25/13
to pystat...@googlegroups.com
(Sorry for the spam if it's old news.)

Warren

josef...@gmail.com

unread,
Aug 25, 2013, 11:32:49 PM8/25/13
to pystatsmodels
Thanks for the link. I haven't seen this before.


I hope we get our vbench and we can also start to cut some slack.

Josef

 

Warren


josef...@gmail.com

unread,
Aug 27, 2013, 9:54:09 AM8/27/13
to pystatsmodels
We still have large gaps.

Josef

 

Josef

 

Warren



Vincent Arel

unread,
Aug 27, 2013, 10:04:14 AM8/27/13
to pystat...@googlegroups.com
And 2 developers can't fill all those gaps. SM badly needs a package/library system that allows users to easily leverage the formula/data-handling/optimizing/summary-printing infrastructure that SM has in-place. :)

Perhaps one day I'll do a write-up of my experience porting SM's quantreg code to a Julia package. I don't think anyone has actually used my package, but it was a breeze to put together and it's "useable".


Vincent

josef...@gmail.com

unread,
Aug 27, 2013, 10:18:24 AM8/27/13
to pystatsmodels
On Tue, Aug 27, 2013 at 10:04 AM, Vincent Arel <vincen...@gmail.com> wrote:
On Tue, Aug 27, 2013 at 9:54 AM, <josef...@gmail.com> wrote:



On Sun, Aug 25, 2013 at 11:32 PM, <josef...@gmail.com> wrote:



On Sun, Aug 25, 2013 at 11:16 PM, Warren Weckesser <warren.w...@gmail.com> wrote:
(Sorry for the spam if it's old news.)

Thanks for the link. I haven't seen this before.


I hope we get our vbench and we can also start to cut some slack.

We still have large gaps.

Josef

 

Josef

 

Warren




And 2 developers can't fill all those gaps. SM badly needs a package/library system that allows users to easily leverage the formula/data-handling/optimizing/summary-printing infrastructure that SM has in-place. :)

What do you have in mind. Everyone can write a package that reuses code in statsmodels, the same as we do when we add a new model to statsmodels.

We are not 2 developers to fill the gaps. You wrote quantreg and most of panel. We have the GSOC project to add new packages. GEE will be a big new class of models in 0.6, and so on.

A bottleneck is integrating new code into statsmodels, and building up the infrastructure code for new model categories.


Perhaps one day I'll do a write-up of my experience porting SM's quantreg code to a Julia package. I don't think anyone has actually used my package, but it was a breeze to put together and it's "useable".



Julia ??

Josef
 
Vincent

Vincent Arel

unread,
Aug 27, 2013, 11:15:10 AM8/27/13
to pystat...@googlegroups.com
Just brainstorming here...

I think there is much that could be learned from the CRAN model and, to a lesser extent, from Julia's attempt to reimagine libraries via github packaging.

In my view, there are three main issues with the current setup. 

The first is the one you point out: the bottleneck with integration of contributed code into statsmodels. There's no getting around the fact that you guys have limited time to work on that side. Allowing an easy way to plug-in and load external contributions in a way that works seamlessly would be an advantage here, because new contributions don't necessarily have to be merged into the main codebase immediately to be useful.

The second problem is that as you get more contributions from the community, the code base grows and maintenance comes to occupy a lot of you guys's time. I think the psychology of the process goes like this: Here's a shiny new model I've implemented, I will collaborate with Josef and Skipper to get it integrated. But once it's there, I move on and stop thinking about it, forcing the SM maintainers to basically adopt my code and to take care of it indefinitely. My experience with a couple very simple R packages suggests that even if I don't really use them anymore, I still feel a sense of ownership and I keep maintaining and improving them as I get bug reports and suggestions. This is kinda sneaky, but I think that in the end it's a desirable feature of the CRAN / Julia packaging setup.

Finally, I think the barrier to entry for contributions is too high. It took me a good while to figure out the inheritance scheme and to know where to go from there. And often, the available classes have lots of attributes and a complicate structure. So if you want to inherit from RegressionModel for example, you'll have to fill or silence many, many, many things. This rules out many simple but useful contributions. 

For example, not long ago I wanted to give a demo to someone and was looking for a quick 2SLS implementation. There seems to be many things in the sandbox for this, but I couldn't quite figure what was working and what wasn't, so I thought I would just code up a quick one and then do a PR so that there's at least a simple 2SLS available. But I quickly gave up when thinking about all the details I'd have to take care of. In contrast, if you look at the tsls() function from the sem package in R, you'll see that it's only about 30 lines of code. Sure, it's very basic and lacking diagnostics and such. But (A) it works, and (B) it's available. Allowing contributors to write-up plugins could allow users to have access to many of the same kinds of "half-baked-but-functional" stuff that's on CRAN right now.

Skipper had mentionned on the list a while ago something about a plugin system. I don't know what he had in mind, but I don't think it needs to be very complicated. Just a well documented set of minimalist classes from which one can inherit. A clear way to interact with formula and missing data functionalities, and an easy way to create pretty printed summaries. Couple that with a standard package structure, and some mechanism for hosting and loading the plugins, and I think we're in business.

Vincent

PS: Yeah, Julia. Just wanted to know what all the fuss was about and thought quantreg would be a good way to learn. Spent only an afternoon on it.

Vincent Arel

unread,
Aug 27, 2013, 11:16:41 AM8/27/13
to pystat...@googlegroups.com
Also, you write "Everyone can write a package that reuses code in statsmodels, the same as we do when we add a new model to statsmodels."

But the fact that they don't, I think, is telling.

Vincent 

Nathaniel Smith

unread,
Aug 27, 2013, 12:21:44 PM8/27/13
to pystatsmodels

On 27 Aug 2013 16:17, "Vincent Arel" <vincen...@gmail.com> wrote:
> Also, you write "Everyone can write a package that reuses code in statsmodels, the same as we do when we add a new model to statsmodels."
>
> But the fact that they don't, I think, is telling.

One thing we obviously need is a good tutorial doc walking through how to implement a good stat model in python and release it as an independent package. What else?

-n

Vincent Arel

unread,
Aug 27, 2013, 9:24:26 PM8/27/13
to pystat...@googlegroups.com
Yes, documentation for sure. Again, just brainstorming here, but some other ideas might be:

1) A centralized index of contributed packages and functions that allow search and download from within statsmodels. People may not want to produce packages if they'll just get lost on github somewhere. A centralized index hosted by SM would provide exposure to these other projects.

2) A standalone function that accepts a formula and a dataframe and returns an object with the basic statsmodels attributes, appropriately named (e.g. exog and endog numpy arrays on which the programmer will operate). Need a well documented & easy single point of entry for formula and (missing?) data handling.

3) There was some thoughts a little while ago of pulling out some of the objective function optimization functions out of LikelihoodModel and to put it inside a more general optimization class that could be used as a mixin. This could have some sensible defaults for numeral derivative as in GenericLikelihood. I don't know if it's a good idea, but it sounds like this might be useful for developpers.

4) A choice of minimalist model or stattest classes with division of labor by use cases. Don't force users who are building regression models to produce AIC. Docs could be "George wants to do this... so he picks this class".

5) Some way to help users produce printed summaries more easily

Mostly, I'm trying to think of things that would provide exposure and ease access to external packages, and things that would expose some the great and powerful existing SM infrastructure in a way that's easier to grasp for developers who don't want to invest a whole lot. 

Vincent

Jl24

unread,
Apr 15, 2015, 11:55:46 PM4/15/15
to pystat...@googlegroups.com
Bump. I independently came to similar conclusion. SM needs modularity if it is to grow faster.

josef...@gmail.com

unread,
Apr 16, 2015, 12:13:09 AM4/16/15
to pystatsmodels
On Wed, Apr 15, 2015 at 11:55 PM, Jl24 <lamp...@gmail.com> wrote:
> Bump. I independently came to similar conclusion. SM needs modularity if it is to grow faster.

Is there anything specific that you are thinking of or ran into.

Josef

lamp

unread,
Apr 16, 2015, 3:10:08 AM4/16/15
to pystat...@googlegroups.com
Yes.  The current statsmodel team is doing amazing tireless work, but you guys can only do so much. 

The advantage of  R is in its numerous grassroots dispersed packages, and the advantage of python (SM+SKlearn) is in its consistency, professionalism and top down organization.

The problem is it is   better to have some   implementation of an estimator  with perhaps an inconsistent interface, than none at all. I often hear similar sentiments. 

We need the best of both, a consistent core and a decentralized periphery that can be gradually incorporated inwards (SM sandbox taken further). 

I think a needed Intervention has two facets :

1. What can be done to facilitate growth and contribution by    statsmodels core team? 
           - Sklearn gets funding, what sources of funding can SM tap into (crowd sourcing, academia, industry)?
           - Incentive maintenance 
2. What can we do to facilitate contributions by the crowd? 

This has two parts, lower technical threshold, and increase  cognitive reward. 

For the latter, VincentAB raised some specific  technical points. We should take this even further and brainstorm how we can define some sort of abstract interface or wrapper that can import or facilitate the development of peripheral packages. Julia stats base can provide some inspiration: https://github.com/JuliaStats/StatsBase.jl 

Next we come to increasing cognitive rewards for contributing to this periphery and perhaps to SM core.  Again I will echo and iterate on VincentAB's points. I've seen several abandoned python stats packages that languish in the corners of repos. There needs to be some sort of task view or collation  that lists packages and maybe a blog- both will start to form a more coherent sense of community outside of just pandas, SM and sklearn. This will lead to increased knowledge, use and sense of contribution around these peripheral packages and thus incentive initial creation and continued maintenance.  

These repos that are separate yet still tied into the community spreads out burden while being plugged in enough to provide a sense of reward to the creator/mainters... from watching the R community, this is a critical piece of R's success imho  

Perhaps within statsmodels core there could be some sort of attribution or tag so that the primary PR writer can help with upkeep. 

Increasing reward/connectivity and lowering technical barriers for such controlled decentralization is critical if python is to succeed as a statistics toolkit. I think the leadership and tireless efforts of the SM team combined  with an improved utilization of the growing pydata community will be an amazing synergy. 

Thoughts?

josef...@gmail.com

unread,
Apr 16, 2015, 11:58:21 AM4/16/15
to pystatsmodels
upfront

I just realized that I never replied at the end of this thread. This
was the beginning of my reply

------------
Late response. I didn't want to send my first response and then got
busy with first week of class and needing to read up on GMM again.

I largely agree with the description of the problem. We need more
maintainers of subpackages and addons.
If supporting and advertising separate addons helps, I'm all for it,
especially in the long run.

This essentially the same reason as the introduction of scikits to
support methods that cannot be added (yet) to scipy.

I don't think statsmodels will get an installer like R or Stata, since
there are the usual python channels and package distributors.
scipy-central might also become a good place after the current rewrite
and extensions.
But we would be able to help with search and advertising.
-----------------
and then I got distracted again. We had some follow-up threads that
were discussing related topics.


My main answers to this are: I don't know, and It needs a champion.

Building a new infrastructure for related packages requires work or a
template. And I haven't seen anything from pandas or scikit-learn
which had the discussion about it also, I also didn't search for any
details.
This will only happen if someone comes up with a feasible
infrastructure and someone implements it.
(I maintained a "related packages" list in the statsmodels
documentation for a while early on in the life of statsmodels, but it
got eventually deleted because it was outdated and nobody kept it up
to date.)
If there is work in this area, then statsmodels can support it in
terms of adjustment to the infrastructure, documentation and so on.

I'm pointing user in mailing lists or stackoverflow to other packages
that I know and where statsmodels doesn't cover the functionality yet,
expecially mlfit for nonlinear least squares, and to Kevin's package
for GARCH.
And there are various scikits on specific topics, e.g. bootstrap,
bootstrap is also available in Kevin's GARCH package.

My personal view is still to try to integrate things into statsmodels.
One issue in the current python package distribution system is that
circular dependencies across packages cause problems in many
distribution channels. We had this for a while with pandas
(statsmodels is an optional dependency of pandas).
The trend in my refactoring of statsmodels in the last year has been
much more focused on tighter control and consistency of subpackages,
which goes for the core model outside of tsa, linear models, discrete
models, GLM and RLM into the opposite direction of separating
packages. This means that we can write new functionality (more
statistical tests, imputation, and similar), in a generic way that
works and can be immediately tested for several model categories.
The separation and modularization of the components of a model, that
Vincent pointed out, is independent of this, but refactoring in that
area is very slow.

Three examples:
Robust covariance matrices were added in a set of generic functions.
Models that provide scoreobs and hessian or something equivalent can
immediately take advantage of it. I integrated it in the `fit` method
as in Stata. R and Julia users have to look for the corresponding
"sandwich" packages which are not integrated with the models directly
(although both R and Julia allow dispatch).

Chad's new statespace models and Kalman filter can be reused for
regression models directly, without installing a separate statespace
package.

Kerby's GEE reused our GLM families and link functions, but we can
reuse the additional covariance structures that GEE brought with it
(although that hasn't happened yet). Kerby is also pushing other code,
like prediction and plot functions into statsmodels, that can be
reused across model categories instead of writing them for a specific
category of models in "his" packages.


That's my focus and priorities. It's not exclusive, but it's where
**my** time goes.
My overall vision is still the Stata pattern with a core package that
can include and merge contributed packages. (Stata documentation has
often the comment at the end that the function is based on or was
originally written as user package.)

I also think we should get more attribution of code to the authors in
statsmodels. We never discussed a policy for this and it is as
unsystematic as in scipy. One problem is that some older modules have
been worked on by many contributors so it's difficult to assign
authorship. Other modules are still dominated by one author. The
second is that I don't think it will increase the involvement of the
original authors by a large amount. One typical pattern for
contributions is that they are written by a PhD student who has two
years available for statsmodels before hitting "real life" which
limits any future maintenance time. Better attribution can and should
still happen independently of this.

Finally,
Someone just needs to implement these ideas.

We are getting a good amount of new contributions, and bugfixes and
improvements in current code.
However, Skipper got a job that leaves him very little time for
statsmodels development and maintenance. Kevin contributed several
important infrastructure and continuous integration improvements,
among them the conversion to a common py2/py3 codebase (besides his
contribution to the model code).
But, there is still a gap. Also, Skipper was the only or main
developer of the interface with pandas and the data handling and the
interface with patsy's formulas.

I like to contribute to and design statsmodels, review code, do Q&A
and spend my time understanding the models. But I'm a mediocre
community organizer, and if I look at statsmodels and see mostly
maintenance and organizational or infrastructure tasks, then it is
difficult to stay motivated to spend most of my available time on it.

Help Wanted.


PS:
(In spite of the slowish pace, and PR's that have to wait too long for a merge.)

I think statsmodels is great. And we are going in a good direction,
both in terms of adding new models and in terms of adding all the
bells and whistles to the existing model (prefix and postestimation
commands in Stata terminology).


Josef

josef...@gmail.com

unread,
Apr 16, 2015, 12:54:07 PM4/16/15
to pystatsmodels
On thing I forgot to mention is about interfacing with statsmodels

statsmodels doesn't have a well specified interface for many new models.

We have well established patterns for the existing categories of
models, although they still change and get enhancments, especially for
MLE where we can subclass GenericLikelihoodModel or follow the
patterns of discrete_model.

But for many cases, statsmodels has neither the infrastructure nor
established patterns. In contrast to scikit-learn, we don't have a
fixed workflow or pipeline like transform-fit-predict of independent
or uncorrelated observations.

Part of the work in a PR can be in making necessary adjustments to
statsmodels, coming up with the design or building the infrastructure
for new categories of models. This could still be model specific at
the beginning, but should be generalizable.
Examples: models with several exog arrays, shows up in mixed and beta
and will show up in many more. Two- or multi-equation models like
Heckman sample selection, or panel data where we still have different
patterns for accessing the panel units.

We won't get the interaction, feedback and future code reuse as much
with separate packages as we do with PRs and merging.

Josef.
Reply all
Reply to author
Forward
0 new messages