I am using the below groupby + apply pattern to return a new dataframe. However, it took about same amount of time in Modin vs. Pandas. I noticed that Modin prints "`DataFrame.<lambda>` defaulting to pandas implementation". Is there anything that could help to accelerate this? Thanks.
def calc_block_agg(x_df):
# each row contains multiple blocks
# identify first (smallest across all first blocks) and last block (largest across all last blocks)
# calculate the span of bytes between them
res = pd.Series(index='firstBlockBegin lastBlockEnd'.split())
min_row = x_df.loc[x_df['firstBlockOffset'].idxmin()]
max_row = x_df.loc[x_df['lastBlockOffset'].idxmax()]
res['firstBlockBegin'] = min_row['firstBlockOffset']
res['lastBlockEnd'] = max_row['lastBlockOffset'] + max_row['lastBlockSize']
res['lastBlockEnd - firstBlockBegin (KB)'] = (res['lastBlockEnd'] - res['firstBlockBegin'])/1024.0
return res
# single id may involve multiple rows across many files
agg_df = df.groupby(['id', 'file']).apply(calc_block_agg)
agg_df.head()