Converting a big Stata program to Pandas?

John Eiler

unread,

Jul 13, 2014, 12:36:46 PM7/13/14

to pyd...@googlegroups.com

I've been handed a fairly large Stata program (around 5,000 lines of code spread throughout several do files), and asked to improve its readability, maintainability and speed (it takes about an hour to run). At this point, I haven't deeply dived into the code, but was hoping someone here might be able to give some general tips. I'm decent with Stata and can do some basic things in Pandas (with a lot of help from the documentation).

The main question I have at this point is: what sort of speed differences can I expect between standard tasks in Stata and Pandas? E.g. merges, sorts, group by, creating new variables from functions of old ones, etc.

I thought I would be able to find some sort of Stata vs Pandas benchmarks to get a sense of this but couldn't find anything via google. I do know from past experience that python/numba is much faster than Stata for basic tasks like generating new variables and summing over the data.

Anyway, my current plan is to take a section of Stata code and convert it to Pandas and see what sort of speed difference I get. On account of Pandas reading and writing *.dta files, I think it should be pretty straightforward to selectively replace things as a first step.

It's probably early to ask this, but would buying something like MKL Optimizations from Continuum speed things up? That is, I understand that would speed up NumPy, so would that indirectly speed up Pandas also?

Anyway, thanks in advance for any comments, suggestions, or warnings!

Skipper Seabold

unread,

Jul 13, 2014, 1:06:49 PM7/13/14

to pyd...@googlegroups.com

On Sun, Jul 13, 2014 at 12:36 PM, John Eiler <eil...@gmail.com> wrote:

I've been handed a fairly large Stata program (around 5,000 lines of code spread throughout several do files), and asked to improve its readability, maintainability and speed (it takes about an hour to run). At this point, I haven't deeply dived into the code, but was hoping someone here might be able to give some general tips. I'm decent with Stata and can do some basic things in Pandas (with a lot of help from the documentation).

The main question I have at this point is: what sort of speed differences can I expect between standard tasks in Stata and Pandas? E.g. merges, sorts, group by, creating new variables from functions of old ones, etc.

Big speed-ups aren't out of the question. I've definitely seen some. The biggest advantage I've seen is in memory management.

It's also much easier to express what you're trying to do with pandas-fu vs. stata and to do it in one pass rather than in several. Groupby is especially helpful. Your 5,000 line do file might end up being a few hundred lines of python, depending.

I thought I would be able to find some sort of Stata vs Pandas benchmarks to get a sense of this but couldn't find anything via google. I do know from past experience that python/numba is much faster than Stata for basic tasks like generating new variables and summing over the data.

Anyway, my current plan is to take a section of Stata code and convert it to Pandas and see what sort of speed difference I get. On account of Pandas reading and writing *.dta files, I think it should be pretty straightforward to selectively replace things as a first step.

Probably a good plan. If you get stuck with translating anything or you wonder if any code is best practice, feel free to post to the list. I'm curious to hear what you find.

It's probably early to ask this, but would buying something like MKL Optimizations from Continuum speed things up? That is, I understand that would speed up NumPy, so would that indirectly speed up Pandas also?

Well, that depends on what you're program is doing I guess. If it's linear algebra-heavy (i.e., has mata code), you might benefit from a good underlying linear algebra library. I wouldn't worry about it much at first. Continuum has a 30-day free trial for its MKL libraries AFAIK.

Anyway, thanks in advance for any comments, suggestions, or warnings!

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Phillip Cloud

unread,

Jul 13, 2014, 1:13:26 PM7/13/14

to pyd...@googlegroups.com

The last thing you should be worried about is speed. With a port of this size you should start with tests that compare the output of the stata program with the output of pandas. Ideally you'd have this for every processing step so that you can verify that you haven't broken anything along the way. Only when your output is correct should you worry about whether things are fast enough. There are many ways to speed up your program but correctness should be your main concern at this point. I promise you'll thank your future self for writing a bunch of tests so that you don't have to say a prayer every time the program runs. I would start with creating some kind of wrapper that allows you to call stata from within Python, then writing a series of tests that cover as much of the functionality as you want to preserve. Then start porting and every time you make a change run the tests. If something breaks you'll know before writing a bunch more code that introduces even more bugs.

--

You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Best,

Phillip Cloud

Message has been deleted

John E

unread,

Jul 13, 2014, 9:11:44 PM7/13/14

to pyd...@googlegroups.com

Charles, completely agree that correct >> fast. At the same time, the main reason for doing a conversion is to make it faster. If Pandas isn't faster, it will probably have to be done in either fortran or numpy, but that would take longer to program of course.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

John E

unread,

Jul 13, 2014, 9:16:32 PM7/13/14

to pyd...@googlegroups.com

Probably a good plan. If you get stuck with translating anything or you wonder if any code is best practice, feel free to post to the list. I'm curious to hear what you find.

Thanks Skipper, I'd be happy to post further here on specifics when I get deeper into the translation.

Well, that depends on what you're program is doing I guess. If it's linear algebra-heavy (i.e., has mata code), you might benefit from a good underlying linear algebra library. I wouldn't worry about it much at first. Continuum has a 30-day free trial for its MKL libraries AFAIK.

There are some matrices and equations, but I'm not sure yet how much. I think basic data management will be the bulk of the compute time though.

Phillip Cloud

unread,

Jul 13, 2014, 9:20:14 PM7/13/14

to pyd...@googlegroups.com

In that case there aren't any Python libraries for data analysis that are faster and more comprehensive than pandas. One thing to watch out for is applying a plain Python function to a groupby with a large number of groups (> 100k). In any case, we're here to help so don't hesitate to ask.

--

You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Hobson

unread,

Jul 14, 2014, 5:47:49 PM7/14/14

to pyd...@googlegroups.com

On Sun, Jul 13, 2014 at 6:20 PM, Phillip Cloud <cpc...@gmail.com> wrote:

In that case there aren't any Python libraries for data analysis that are faster and more comprehensive than pandas. One thing to watch out for is applying a plain Python function to a groupby with a large number of groups (> 100k). In any case, we're here to help so don't hesitate to ask.

Just curious: what's the preferred alternative here? A numpy ufunc?

-p

Phillip Cloud

unread,

Jul 14, 2014, 5:50:29 PM7/14/14

to pyd...@googlegroups.com

Anything that has been Cython-ized (taking advantage of static typing) works well. You may have to play around with boxing numpy arrays into Series in another function that calls your Cython function. UFuncs might work, I've never tried them.

--

Best,

Phillip Cloud

--

John E

unread,

Jul 16, 2014, 5:51:23 PM7/16/14

to pyd...@googlegroups.com

For anyone who is interested, reading stata datasets directly into pandas looks like a no go. It works fine, just too slowly, even on pretty small files.

E.g. Pandas reads in a 100m CSV in about 15 seconds, but the corresponding DTA file takes around 90 seconds. HDF on the same file is very fast of course. This is very simple data btw, just integers and reals. The DTA is from stata version 11.2.

The workaround is not bad at all, I'm just going to end up passing CSVs from stata to pandas and then saving as H5. The loss of speed will not be much of a problem, I was just hoping to avoid the extra book keeping from having 3 versions of datasets.

John E

unread,

Jul 18, 2014, 5:07:30 PM7/18/14

to pyd...@googlegroups.com

Speed question: I'm just trying to find a simple and fast way to create a variable based on numpy array (table lookup basically).

arr = np.array([ 900, 700, 1000 ])

df.yy.apply(lambda x: arr[x])

That does what I want, although it is much slower (20x?) than the corresponding stata code. The following also works and is about 5x faster (thanks numba!) than the apply/lambda but it's getting to be a lot of typing to do something simple.

@jit

def fpl_numba(x):

return arr[x]

df.yy.apply(fpl_numba)

Any hints on a better and faster way to do this? Sorry if this is a dumb question but that's the best I have managed so far. Here's the stata code btw:

generate fpl = arr[yy]

Phillip Cloud

unread,

Jul 18, 2014, 5:11:41 PM7/18/14

to pyd...@googlegroups.com

You can use Series.map

In [10]: yy = Series(np.random.choice([0,1,2],size=100))

In [11]: arr

Out[11]: array([ 900, 700, 1000])

In [12]: yy.map(Series(arr))

Out[12]:

0 900

1 1000

2 700

3 1000

4 700

...

94 1000

95 700

96 900

97 900

98 700

99 900

Length: 100, dtype: int64

--

Best,

Phillip Cloud

John E

unread,

Jul 18, 2014, 5:24:04 PM7/18/14

to pyd...@googlegroups.com

That's much better in all ways. Thanks Charles!

Message has been deleted

John E

unread,

Jul 24, 2014, 11:53:22 PM7/24/14

to pyd...@googlegroups.com

Instead of

yy.map(Series(arr))

will this also work?

Series(arr[yy])

Mainly I ask because I want to generalize to arr being a n-dimensional array and can't figure out how to do it with the former method.

I was able to get the following to work as desired with a two-dimensional array:

Series(ar[ df.x, df.y ])

Although in this simpler example as posted, I'm getting an error about an "unsupported iterator index". As far as I can tell all the types and dtypes are the same, indices are in bounds, etc. I can't figure out why one works and the other doesn't, and hope I didn't just get the one example to work by blind luck or something!

On Friday, July 18, 2014 5:11:41 PM UTC-4, Charles Cloud wrote:

John E

unread,

Jul 25, 2014, 10:10:54 AM7/25/14

to pyd...@googlegroups.com

This works, for example:

In [412]: yy = Series(np.random.choice([0,1,2],size=10))

In [413]: arr2 = np.array( [[10,20,30],[40,50,60],[70,80,90]] )

In [414]: Series( arr2[ yy, yy ])

Out[414]:

0 90
1 50
2 90
3 50
4 90
5 50
6 90
7 90
8 10
9 10

A very simple rewrite of this example to be 1-dimensional breaks it though (pasted below if it matters). Strange that it seems to work in 2 dimensions but not 1. With some searching I find this reference to a NumPy bug that was apparently fixed in version 1.9 (I'm using the latest Anaconda with NumPy 1.8.1), not sure if it is the same issue though.

https://github.com/pydata/pandas/issues/6168

In [418]: arr3 = arr2[yy,0]

In [419]: Series( arr3[ yy ])

---------------------------------------------------------------------------

IndexError Traceback (most recent call last)

<ipython-input-419-cb07f86c8fd7> in <module>()

----> 1 Series( arr3[ yy ])

IndexError: unsupported iterator index

John E

unread,

Jul 25, 2014, 10:42:16 AM7/25/14

to pyd...@googlegroups.com

Not sure what questions go here vs stackoverflow so just posted there btw:

https://stackoverflow.com/questions/24958233/mapping-a-series-with-a-numpy-array-dimensionality-issue

Reply all

Reply to author

Forward