I've been handed a fairly large Stata program (around 5,000 lines of code spread throughout several do files), and asked to improve its readability, maintainability and speed (it takes about an hour to run). At this point, I haven't deeply dived into the code, but was hoping someone here might be able to give some general tips. I'm decent with Stata and can do some basic things in Pandas (with a lot of help from the documentation).The main question I have at this point is: what sort of speed differences can I expect between standard tasks in Stata and Pandas? E.g. merges, sorts, group by, creating new variables from functions of old ones, etc.
I thought I would be able to find some sort of Stata vs Pandas benchmarks to get a sense of this but couldn't find anything via google. I do know from past experience that python/numba is much faster than Stata for basic tasks like generating new variables and summing over the data.Anyway, my current plan is to take a section of Stata code and convert it to Pandas and see what sort of speed difference I get. On account of Pandas reading and writing *.dta files, I think it should be pretty straightforward to selectively replace things as a first step.
It's probably early to ask this, but would buying something like MKL Optimizations from Continuum speed things up? That is, I understand that would speed up NumPy, so would that indirectly speed up Pandas also?
Anyway, thanks in advance for any comments, suggestions, or warnings!
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Probably a good plan. If you get stuck with translating anything or you wonder if any code is best practice, feel free to post to the list. I'm curious to hear what you find.
Well, that depends on what you're program is doing I guess. If it's linear algebra-heavy (i.e., has mata code), you might benefit from a good underlying linear algebra library. I wouldn't worry about it much at first. Continuum has a 30-day free trial for its MKL libraries AFAIK.
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
In that case there aren't any Python libraries for data analysis that are faster and more comprehensive than pandas. One thing to watch out for is applying a plain Python function to a groupby with a large number of groups (> 100k). In any case, we're here to help so don't hesitate to ask.
--
Series(ar[ df.x, df.y ])
Although in this simpler example as posted, I'm getting an error about an "unsupported iterator index". As far as I can tell all the types and dtypes are the same, indices are in bounds, etc. I can't figure out why one works and the other doesn't, and hope I didn't just get the one example to work by blind luck or something!
A very simple rewrite of this example to be 1-dimensional breaks it though (pasted below if it matters). Strange that it seems to work in 2 dimensions but not 1. With some searching I find this reference to a NumPy bug that was apparently fixed in version 1.9 (I'm using the latest Anaconda with NumPy 1.8.1), not sure if it is the same issue though.
IndexError: unsupported iterator index