Re: Tutorial on how to write Cython code to read/return DataFrame?

1,730 views
Skip to first unread message

Richard Stanton

unread,
Jun 10, 2013, 2:55:37 PM6/10/13
to pyd...@googlegroups.com, jeffr...@gmail.com
Jeff:

Thanks very much for your suggestions a while ago on getting cython to
work with dataframes. I can now access and play with the elements of a
simple data frame very nicely (and much faster than in my original highly
serial python code).

Just a couple of things I wasn't sure about in your original message(s) -
I'd be very grateful if you could elaborate a little.

1) "another issue is you have to be very careful about dtypes (they need
to be passed as separate arrays)"

Not quite sure what this means! Right now, things work very nicely with
hard-coded, single data-types, but I'd like to be able to pass data where
some columns are doubles, some ints, etc. It sounds like your comment may
relate to this, but I don't quite follow.

2) "in pandas we use cython to operate and return numpy arrays. Then u can
use the normal constructors"

Again, I'm afraid I don't follow. Maybe you could point me to where in the
pandas source some of this stuff is done, as I'm sure it would be very
valuable to see some examples?

Best,

Richard Stanton



Jeff

unread,
Jun 10, 2013, 3:08:24 PM6/10/13
to pyd...@googlegroups.com, jeffr...@gmail.com
sure

1) you can only pass a single dtype object to cython, so say you have a dataframe with ints/floats, you will need to pass this in 2 separate arguments
you can get these via df.blocks (which returns a dict of dtype -> a DataFrame)

so for simplicity you want to either have multiple functions which operate on a single dtype (and do the same thing)

e.g.

sum_float64
sum_int64
.....

of you bake it into your function...e.g.

def myfunction(ndarray float64_t floats, ndarray int64_t ints).....

etc

2) I mean exactly this, we don't pass DataFrames at all to the cython functions, rather df.values (of in some
cases df.blocks), you can, but it rarely necessary

so do something like this:

df

v = mycool_function_in_cython(df.values)
new_def = DataFrame(v)

Richard Stanton

unread,
Jun 11, 2013, 4:57:08 PM6/11/13
to pyd...@googlegroups.com
Thanks again, Jeff. It's now starting to fall into place. Just one more
question:

"you can only pass a single dtype object to cython, so say you have a
dataframe with ints/floats, you will need to pass this in 2 separate
arguments. you can get these via df.blocks (which returns a dict of dtype
-> a DataFrame)"

When I have a single-type dataframe, then I can pass df.values, have my
cython code modify the contents, and the modifications will show up in the
original data frame, df, when I exit. Unless I'm missing something,
df.blocks seems to allow me read but not write access to the original
dataframe. What's a good way to modify a mixed-type dataframe in-place
with Cython?


Jeff Reback

unread,
Jun 11, 2013, 5:10:00 PM6/11/13
to pyd...@googlegroups.com
you need to delve into the internals a bit

create a mixed type df

for block in df._data.blocks:

# if u modify this it will be reflected in
# the frame. this is a numpy array
block.values

block has many more useful properties

eg dtype
ref_items which are basically the columns in that block

the block manager is df._data

the axes of the frame are in .axes

you can modify blocks if you want

in general it's a better idea to return a new numpy array (and then create a new block)

see core/internals.py
> --
> You received this message because you are subscribed to the Google Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Richard Stanton

unread,
Jun 11, 2013, 6:29:15 PM6/11/13
to pyd...@googlegroups.com
Excellent! That's exactly what I was after. Thanks, Jeff. 

I'll also take a look at core/internals.py. More generally, is there any kind of road map to the pandas source code? Browsing through it would be a great way to learn how to do stuff like this, but there's a lot of source and it's not at all obvious (to me, at least) which file to look at.

Jeff Reback

unread,
Jun 11, 2013, 6:46:25 PM6/11/13
to pyd...@googlegroups.com
I usually just step thru a statement in the debugger from the top level

no roadmap per se

but if u have questions I'll try to help u out

out of curiosity what is your end goal here?

On Jun 11, 2013, at 6:29 PM, Richard Stanton <sta...@haas.berkeley.edu> wrote:

Excellent! That's exactly what I was after. Thanks, Jeff. 

I'll also take a look at core/internals.py. More generally, is there any kind of road map to the pandas source code? Browsing through it would be a great way to learn how to do stuff like this, but there's a lot of source and it's not at all obvious (to me, at least) which file to look at.

Richard Stanton

unread,
Jun 11, 2013, 6:59:43 PM6/11/13
to pyd...@googlegroups.com
Thanks again, Jeff.

My goal is simple - speed. I have a (working) C program to do some statistical estimation of option-exercise rates. To run this code takes about 10-15 hours of CPU time, but it's spread it across 10 cores on my machine, giving me a quite manageable hour and a half or so. Because managing the data in C is a pain, however, I'm rewriting the code in Python, but calculating the objective function in the minimization is a *lot* slower, mainly because, due to the size of the datasets, I have to merge subsets of the data each time I calculate the function for a new set of parameters, and many of the calculations within the merge end up being serial in nature, so the code takes way longer than its C equivalent (I haven't done a formal comparison, but I'd estimate a factor at least in the hundreds, and quite possibly in the thousands), so I desperately need to speed it up. Cython is one part of the solution, and parallelization will be the other. I don't mind if it takes a few hours instead of an hour and a half, but a few weeks or a few months is a bit much...


Jeff Reback

unread,
Jun 11, 2013, 7:27:10 PM6/11/13
to pyd...@googlegroups.com
sounds cool

you know that u can call c functions directly from cython with the raw numpy data (google it as its a bit non-trivial)

On Jun 11, 2013, at 6:59 PM, Richard Stanton <sta...@haas.berkeley.edu> wrote:

Thanks again, Jeff.

My goal is simple - speed. I have a (working) C program to do some statistical estimation of option-exercise rates. To run this code takes about 10-15 hours of CPU time, but it's spread it across 10 cores on my machine, giving me a quite manageable hour and a half or so. Because managing the data in C is a pain, however, I'm rewriting the code in Python, but calculating the objective function in the minimization is a *lot* slower, mainly because, due to the size of the datasets, I have to merge subsets of the data each time I calculate the function for a new set of parameters, and many of the calculations within the merge end up being serial in nature, so the code takes way longer than its C equivalent (I haven't done a formal comparison, but I'd estimate a factor at least in the hundreds, and quite possibly in the thousands), so I desperately need to speed it up. Cython is one part of the solution, and parallelization will be the other. I don't mind if it takes a few hours instead of an hour and a half, but a few weeks or a few months is a bit much...


Richard Stanton

unread,
Jun 12, 2013, 12:06:39 PM6/12/13
to pyd...@googlegroups.com
Thanks for the suggestion. My hope is to be able to just use Cythonized python code as much as possible, as I'm finding coding in Python so much more pleasant than C, but this is useful to know.

Richard Stanton

unread,
Jun 13, 2013, 6:40:47 PM6/13/13
to pyd...@googlegroups.com
One thing that had me scratching my head for a while: When you use this approach, it seems you need to reverse the usual order of the indices in order to access the elements.

Jeff Reback

unread,
Jun 13, 2013, 6:52:56 PM6/13/13
to pyd...@googlegroups.com
that's correct
look at axes
it's stored columns then index



I can be reached on my cell 917-971-6387

Richard Stanton

unread,
Jun 14, 2013, 1:42:44 PM6/14/13
to pyd...@googlegroups.com
Hi Jeff:

Merely replacing one 6-line segment of code with a cython equivalent has reduced my program's execution time by about 70%, so this quest is definitely worthwhile! There is a bit more to do along these lines, but now I have a new issue. Using the profiler, I realize that the majority of execution time is now spent executing the pandas "merge" command (to calculate the objective function for the entire dataset, I need to run this command over a million times to merge lots of chunks of data). I need to reduce this time substantially. I strongly suspect that a lot of the time now is being spent allocating memory, so I'd like to stop doing this.

Since I use the merged chunks of data sequentially, in my C code I allocate up-front one buffer large enough to hold the largest conceivable merged output then loop through, merging, putting the results into this pre-allocated memory, processing, then moving on to the
next chunk. I'd like to replicate this in my Python code, so ideally I'd start by allocating 3 large ndarrays for strings, ints, and floats respectively, then loop through my data, merging and using these ndarrays to store the results. In principle this doesn't seem too hard, but I do have a few questions:

1) Is there any way to get pandas' existing merge command to use pre-allocated memory for its output? I assume I'm going to have to code this myself, but it can't hurt to ask...

2) Once I have my output in these ndarrays, how do I turn them into a dataframe? There are (at least) two issues I can see here:

   a. I'm using three blocks of different types, not just a single homogeneous ndarray - I'm not sure how to turn heterogeneous ndarrays into a single dataframe.

   b. Because the blocks are preallocated to fit the largest possible output, when I actually perform a merge and put the output into these blocks, I'll only want to use a subset of the blocks in my dataframe.

Thanks again for your help here. Hopefully this discussion will be of some value to others interested in similarly speeding up their pandas code, but I don't want to hog too much bandwidth, so do let me know if there's a more appropriate forum for this discussion.

Best,

Richard

Jeff

unread,
Jun 14, 2013, 2:30:26 PM6/14/13
to pyd...@googlegroups.com
Can you show the section of code that does the merge?
2 b) is easy, just pre-allocate and slice off what you need
e.g. df = a big frame
df.iloc[0:last_index_i_used]
This I think will even be done with a view so should be very fast
 
I think you should just keep separate floats/int/strings in separate frames entirely, your code will be simpler
and you won't have to deal with blocks directly....my 2c
 
for example merging ndarrays of different types can be done but you would have to create the blocks, add to the block manager and then create the frame
depends how much you want to delve into the internals

Richard Stanton

unread,
Jun 14, 2013, 3:25:27 PM6/14/13
to pyd...@googlegroups.com
Sure. First, to describe the program:

There are two sets of data. One (daily) with stock price and related information. The other contains one observation each time an employee exercises an option (plus other events). I care about all the days when nothing happened as well as the days when options were exercised, so I want to merge the two datasets together to create one huge dataset listing for each option held by each individual on each day how many were exercised (plus assorted explanatory variables). The problem is that I don't have enough memory to store the entire merged dataset in RAM at once (it would have about half a billion observations and lots of variables, and I might end up getting more data at some point...), and storing anything to disk would make this impossibly slow, so what I do is loop through, merging all the data for one employee at a time, processing, and moving on to the next employee. 

The relevant section of code looks like this:

def mergeData(exData):
    '''Merge exercise data with (more frequently observed) stock price data for a given grant/person.
    
    Input: exercise data for one grant/person combination.
    Output: same, merged with appropriate stock price data.
    '''
    
    firmNo = exDataUnique['firmNo2'].iloc[0]
    firstDate = exDataUnique.date.iloc[0]
    lastDate = exDataUnique.date.iloc[-1]

    # Keep stock price data only for range we are interested in
    stockData = (stock[(stock.firmNo == firmNo) & (stock.date >= firstDate) & (stock.date <= lastDate)])

    # Merge.
    merged = stockData.merge(exDataUnique.drop(['firmNo','firm','nprc1'], axis=1), how='left', on='date')
    
    # Now interpolate the series to fill in the blanks

   [...]

grouped = exClean.groupby(['firmNo','npersonid1'])      # Group exercise data by firm and employee

for name, group in grouped:
        # Expand dataset for that person
        data = group.groupby('unique_num').apply(mergeData)

        # Now process that employee's data....

Richard Stanton

unread,
Jun 16, 2013, 5:30:28 PM6/16/13
to pyd...@googlegroups.com
I created some Cython code to merge some test data, putting the results into a preallocated dataframe. It's about 50x as fast as doing it using pandas merge. A worthwhile experiment!

On Friday, June 14, 2013 11:30:26 AM UTC-7, Jeff wrote:

Jeff Reback

unread,
Jun 16, 2013, 6:00:17 PM6/16/13
to pyd...@googlegroups.com

great!

as u can see sometimes solving a specific problem can be sped up a lot

Jeff

unread,
Jun 17, 2013, 3:18:22 PM6/17/13
to pyd...@googlegroups.com, jeffr...@gmail.com

Richard,
you have inspired some folks to make some docs on this!

this is prelim

Richard Stanton

unread,
Jun 17, 2013, 5:50:52 PM6/17/13
to pyd...@googlegroups.com
Great idea! I'd be happy to contribute some examples based on what I've been doing recently (with your help). Combining pandas with Cython can be very valuable to those of us who like the convenience of pandas, but face problems that are heavy enough CPU hogs that they just take too long to run when (first) coded in Python/pandas.

--
You received this message because you are subscribed to a topic in the Google Groups "PyData" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pydata/aLxALYqosOU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pydata+un...@googlegroups.com.

Andy Hayden

unread,
Jun 19, 2013, 7:30:09 PM6/19/13
to pyd...@googlegroups.com
Any feedback/ideas/additions, greatly appreciated:

https://github.com/pydata/pandas/pull/3965


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages