Thanks,Joe Wyer
Hi all,My name is Timothy Brathwaite, and I'm a PhD student at UC Berkeley working in the area of discrete choice with Prof Joan Walker.I don't have any experience developing software professionally or contributing to large open-source projects such as statsmodels. However, I have implemented a number of discrete choice models, including the conditional logit model in python. I developed a fully functioning module that has worked very well for myself and others in my research group, and I just published it to PyPi today:I haven't seen any conditional logit implementation in statsmodels yet, especially none that accounts for choice sets that vary across observations and that allows coefficients to be constrained across a subset of alternatives (in addition to the usual choices of having a single coefficient across all alternatives or a different coefficient for each alternative).I think that the correct place for such a package is really within statsmodels as opposed to being a standalone package, and I would like to contribute if possible.Would anyone here be willing to take a look at my project (there are examples on the github page) and provide some direction on how best to go about contributing?Thanks for reading,
On Mon, Mar 14, 2016 at 10:30 PM, Timothy Brathwaite <timoth...@gmail.com> wrote:Hi all,My name is Timothy Brathwaite, and I'm a PhD student at UC Berkeley working in the area of discrete choice with Prof Joan Walker.I don't have any experience developing software professionally or contributing to large open-source projects such as statsmodels. However, I have implemented a number of discrete choice models, including the conditional logit model in python. I developed a fully functioning module that has worked very well for myself and others in my research group, and I just published it to PyPi today:I haven't seen any conditional logit implementation in statsmodels yet, especially none that accounts for choice sets that vary across observations and that allows coefficients to be constrained across a subset of alternatives (in addition to the usual choices of having a single coefficient across all alternatives or a different coefficient for each alternative).I think that the correct place for such a package is really within statsmodels as opposed to being a standalone package, and I would like to contribute if possible.Would anyone here be willing to take a look at my project (there are examples on the github page) and provide some direction on how best to go about contributing?Thanks for reading,Hi Timothy,Thanks, it's great to see someone else working in this area. With some effort it should be possible to get this area well covered in statsmodels.Did you compare your implementation with the ConditionalLogit in the former GSOC PR?I only had time for some partial skimming of your package. So I leave most comments until after I have have looked at it a bit more carefully.In terms of implementation the main difference is that we separate out the results into Results classes and don't store the estimation results as attributes of the model instance. Also, most of our results in statsmodels are calculated lazily on demand, while AFAICS you calculate all results immediately.Overall, AFAICS, you are not subclassing any of our models or results, which requires code duplication from our perspective. However, there might be some parts where statsmodels still needs refactoring and part of your code might be closer to a future statsmodels than the current one. (Just a vague impression right now, especially for the usage of scipy minimize.)
The notebooks look good, but I didn't have time to check the details yet.,
Hey Josef, thanks for taking a look!You're right--I do calculate all results immediately, and I'm not yet subclassing any of your models or results. This was just to reduce the amount of dependencies needed. I haven't actually looked under the hood at your results or model classes yet, beyond what was needed to use the statsmodels summary tables. I'm happy to take a look at whether I could easily subclass your existing models or results to reduce as much code duplication as possible.In terms of the choice specific variables, filling the design matrix with zeros was done specifically to make computation the same regardless of whether or not a variable shows up in the utility equation for a given alternative or not. As for identification, as long as there is alternative variation, the parameter should be identified. I think that alternative variation is sufficient for identification and that an alternative specific intercept isn't necessary. My usual go to for issues of identification are: Ken Train's Discrete Choice Methods with Simulation, section 2.5 (Chapter 2. Properties of Discrete Choice Models). He doesn't discuss choice specific variables per se though.
I literally just found out about the numpy docstring standard when trying to register pylogit on PyPi. I wish I had known about it sooner! I'll get to reformatting my docstrings next week (as I have a wedding to go to this weekend).As far as extensions go, I have four other models already coded and working with pylogit on my own computer. Two of these models are related to known models. They are multinomial generalizations of the clog-log model and the scobit model ("Scobit: An Alternative Estimator to Probit and Logit"). The other two models are new generalizations of the conditional logit model that I developed in my research. I'm currently writing an article on these four models now, so I decided to hold off on including them in the first pylogit release. In terms of differences from the standard logit model here's a picture of what they look like in the binary case. They are all asymmetric choice models (in terms of the shape of their probability curve, and they [often] have an extra number of parameters to be estimated in order to determine the precise shape of the curve). In this sense they are very similar to the "Generalized Logistic Model" described by Stukel.