Why this groupby code is so slow?

Luiz Vitor Martinez Cardoso

unread,

Oct 2, 2013, 3:15:11 PM10/2/13

to pyd...@googlegroups.com

Dear friends,

I did a wide research and tried everything I could before writing this message for you. So, if you can try to help me ;)

Let's say we have a data source providing tons of bits, like:

dst_ip;payload;payload_response;payload_type;src_ip;timestamp
28;#live?;#live;9437184;-1;1377186527.0
25;#live?;#live;9437184;-1;1377186527.0
24;#live?;#live;9437184;-1;1377186527.0
.
.
.

Now we have to group this data by getting the last two integer digits from timestamp and applying a simple if-chain:

25 if h <= 25 else 50 if h > 25 and h <= 50 else 75 if h > 50 and h <= 75 else 99

In other words, it will map the last two integer digits from timestamp (00 to 99) to only four groups (25, 50, 75 and 99).

I translated this basic idea in the following code:

def gby_timestamp_window(i):
t = self.heart_beat.loc[i, 'timestamp']
h = t % 100
l = 25 if h <= 25 else 50 if h > 25 and h <= 50 else 75 if h > 50 and h <= 75 else 99
return t + l - h

availability_window_list = self.heart_beat.groupby(
by=gby_timestamp_window,
sort=True
).groups

The problem is that t = self.heart_beat.loc[i, 'timestamp'] is very, very slow and represents ~85% of all time spent to finish the processing.

I already tried to replace it by:

t = self.heart_beat['timestamp'][i]

How can I achieve the same results and improve the processing time?

Best regards,

Luiz Vitor.

Jeff

unread,

Oct 2, 2013, 3:40:42 PM10/2/13

to pyd...@googlegroups.com

You don't want to do a non-vectorized loop with indexing like that

Try this

In [12]: df_orig

Out[12]:

dst_ip payload payload_response payload_type src_ip timestamp

0 28 #live? #live 9437184 -1 1377186527

1 25 #live? #live 9437184 -1 1377186527

2 24 #live? #live 9437184 -1 1377186527

Make a bigger frame

In [15]: df_orig2 = concat([ df_orig ] * 1000).reset_index()

Make the timestamps interesting

In [23]: df_orig2['timestamp'] = df_orig2['timestamp']+df_orig2.index.values

In [24]: df_orig2

Out[24]:

Int64Index: 3000 entries, 0 to 2999

Data columns (total 7 columns):

index 3000 non-null values

dst_ip 3000 non-null values

payload 3000 non-null values

payload_response 3000 non-null values

payload_type 3000 non-null values

src_ip 3000 non-null values

timestamp 3000 non-null values

dtypes: int64(5), object(2)

In [25]: df_orig2.head()

Out[25]:

index dst_ip payload payload_response payload_type src_ip timestamp

0 0 28 #live? #live 9437184 -1 1377186527

1 1 25 #live? #live 9437184 -1 1377186528

2 2 24 #live? #live 9437184 -1 1377186529

3 0 28 #live? #live 9437184 -1 1377186530

4 1 25 #live? #live 9437184 -1 1377186531

This is a vectorized groupby like what you are doing, but very fast

In [37]: df_orig2['grp'] = (df_orig2['timestamp'] % 100)/25

In [42]: df_orig2.groupby('grp')['payload_type'].mean()

Out[42]:

grp

0 9437184

1 9437184

2 9437184

3 9437184

Name: payload_type, dtype: int64

However, you might want to convert to a datetime based column

In [27]: df = df_orig2.copy()

Convert from seconds since epoch

In [28]: df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')

In [29]: df

Out[29]:

Int64Index: 3000 entries, 0 to 2999

Data columns (total 7 columns):

index 3000 non-null values

dst_ip 3000 non-null values

payload 3000 non-null values

payload_response 3000 non-null values

payload_type 3000 non-null values

src_ip 3000 non-null values

timestamp 3000 non-null values

dtypes: datetime64[ns](1), int64(4), object(2)

Resample at 15s intervals (not the same as above, but you get the idea)

In [48]: df.set_index('timestamp').resample('15s',how=np.mean).head()

Out[48]:

index dst_ip payload_type src_ip

timestamp

2013-08-22 15:48:45 0.923077 25.846154 9437184 -1

2013-08-22 15:49:00 1.000000 25.666667 9437184 -1

2013-08-22 15:49:15 1.000000 25.666667 9437184 -1

2013-08-22 15:49:30 1.000000 25.666667 9437184 -1

2013-08-22 15:49:45 1.000000 25.666667 9437184 -1

Luiz Vitor Martinez Cardoso

unread,

Oct 4, 2013, 4:36:06 PM10/4/13

to pyd...@googlegroups.com

Jeff,

Great answer that solved my problem!

Thank you!

Reply all

Reply to author

Forward