Why this groupby code is so slow?

878 views
Skip to first unread message

Luiz Vitor Martinez Cardoso

unread,
Oct 2, 2013, 3:15:11 PM10/2/13
to pyd...@googlegroups.com
Dear friends,

I did a wide research and tried everything I could before writing this message for you. So, if you can try to help me ;)

Let's say we have a data source providing tons of bits, like:
 
dst_ip;payload;payload_response;payload_type;src_ip;timestamp
28;#live?;#live;9437184;-1;1377186527.0
25;#live?;#live;9437184;-1;1377186527.0
24;#live?;#live;9437184;-1;1377186527.0
.
.
.

Now we have to group this data by getting the last two integer digits from timestamp and applying a simple if-chain:

25 if h <= 25 else 50 if h > 25 and h <= 50 else 75 if h > 50 and h <= 75 else 99

In other words, it will map the last two integer digits from timestamp (00 to 99) to only four groups (25, 50, 75 and 99).

I translated this basic idea in the following code:

def gby_timestamp_window(i):
t = self.heart_beat.loc[i, 'timestamp']
h = t % 100
l = 25 if h <= 25 else 50 if h > 25 and h <= 50 else 75 if h > 50 and h <= 75 else 99
return t + l - h
 
availability_window_list = self.heart_beat.groupby(
by=gby_timestamp_window,
sort=True
).groups

The problem is that  t = self.heart_beat.loc[i, 'timestamp'] is very, very slow and represents ~85% of all time spent to finish the processing.

I already tried to replace it by:

t = self.heart_beat['timestamp'][i]

How can I achieve the same results and improve the processing time?

Best regards,
Luiz Vitor.

Jeff

unread,
Oct 2, 2013, 3:40:42 PM10/2/13
to pyd...@googlegroups.com
You don't want to do a non-vectorized loop with indexing like that

Try this

In [12]: df_orig
Out[12]: 
   dst_ip payload payload_response  payload_type  src_ip   timestamp
0      28  #live?            #live       9437184      -1  1377186527
1      25  #live?            #live       9437184      -1  1377186527
2      24  #live?            #live       9437184      -1  1377186527

Make a bigger frame

In [15]: df_orig2 = concat([ df_orig ] * 1000).reset_index()

Make the timestamps interesting

In [23]: df_orig2['timestamp'] = df_orig2['timestamp']+df_orig2.index.values

In [24]: df_orig2
Out[24]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000 entries, 0 to 2999
Data columns (total 7 columns):
index               3000  non-null values
dst_ip              3000  non-null values
payload             3000  non-null values
payload_response    3000  non-null values
payload_type        3000  non-null values
src_ip              3000  non-null values
timestamp           3000  non-null values
dtypes: int64(5), object(2)

In [25]: df_orig2.head()
Out[25]: 
   index  dst_ip payload payload_response  payload_type  src_ip   timestamp
0      0      28  #live?            #live       9437184      -1  1377186527
1      1      25  #live?            #live       9437184      -1  1377186528
2      2      24  #live?            #live       9437184      -1  1377186529
3      0      28  #live?            #live       9437184      -1  1377186530
4      1      25  #live?            #live       9437184      -1  1377186531

This is a vectorized groupby like what you are doing, but very fast

In [37]: df_orig2['grp'] = (df_orig2['timestamp'] % 100)/25

In [42]: df_orig2.groupby('grp')['payload_type'].mean()
Out[42]: 
grp
0      9437184
1      9437184
2      9437184
3      9437184
Name: payload_type, dtype: int64

However, you might want to convert to a datetime based column

In [27]: df = df_orig2.copy()

Convert from seconds since epoch

In [28]: df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')

In [29]: df
Out[29]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000 entries, 0 to 2999
Data columns (total 7 columns):
index               3000  non-null values
dst_ip              3000  non-null values
payload             3000  non-null values
payload_response    3000  non-null values
payload_type        3000  non-null values
src_ip              3000  non-null values
timestamp           3000  non-null values
dtypes: datetime64[ns](1), int64(4), object(2)

Resample at 15s intervals (not the same as above, but you get the idea)

In [48]: df.set_index('timestamp').resample('15s',how=np.mean).head()
Out[48]: 
                        index     dst_ip  payload_type  src_ip
timestamp                                                     
2013-08-22 15:48:45  0.923077  25.846154       9437184      -1
2013-08-22 15:49:00  1.000000  25.666667       9437184      -1
2013-08-22 15:49:15  1.000000  25.666667       9437184      -1
2013-08-22 15:49:30  1.000000  25.666667       9437184      -1
2013-08-22 15:49:45  1.000000  25.666667       9437184      -1

Luiz Vitor Martinez Cardoso

unread,
Oct 4, 2013, 4:36:06 PM10/4/13
to pyd...@googlegroups.com
Jeff,

Great answer that solved my problem!

Thank you!
Reply all
Reply to author
Forward
0 new messages