`DataFrame.<lambda>` defaulting to pandas implementation

42 views
Skip to first unread message

cheng...@gmail.com

unread,
Aug 30, 2019, 7:28:31 PM8/30/19
to modin-dev
I am using the below groupby + apply pattern to return a new dataframe. However, it took about same amount of time in Modin vs. Pandas. I noticed that Modin prints "`DataFrame.<lambda>` defaulting to pandas implementation". Is there anything that could help to accelerate this? Thanks.

def calc_block_agg(x_df):
    # each row contains multiple blocks
    # identify first (smallest across all first blocks) and last block (largest across all last blocks)
    # calculate the span of bytes between them
    res = pd.Series(index='firstBlockBegin lastBlockEnd'.split())
    min_row = x_df.loc[x_df['firstBlockOffset'].idxmin()]
    max_row = x_df.loc[x_df['lastBlockOffset'].idxmax()]
    
    res['firstBlockBegin'] = min_row['firstBlockOffset']
    res['lastBlockEnd'] = max_row['lastBlockOffset'] + max_row['lastBlockSize']
    res['lastBlockEnd - firstBlockBegin (KB)'] = (res['lastBlockEnd'] - res['firstBlockBegin'])/1024.0
    return res

# single id may involve multiple rows across many files
agg_df = df.groupby(['id', 'file']).apply(calc_block_agg)
agg_df.head()

Devin Petersohn

unread,
Aug 31, 2019, 1:47:24 AM8/31/19
to cheng...@gmail.com, modin-dev
Thanks for the question!

The performance discrepancy here is that we don't have a parallel implementation for groupby when multiple columns are passed in. This is in our plans for the upcoming quarter. If the groupby is given a single column, we have an implementation for that and generally should be relatively fast (given multiprocessing overheads).

Let me know if this answers your question!

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/80c1e45a-38bf-4b45-9738-32cd35e2825e%40googlegroups.com.

cheng...@gmail.com

unread,
Sep 1, 2019, 8:40:51 PM9/1/19
to modin-dev
Sounds great. Looking forward to the multi-column groupby ...

Btw, is there a list that summarizes the up-to-date limitations?
To unsubscribe from this group and stop receiving emails from it, send an email to modi...@googlegroups.com.

Devin Petersohn

unread,
Sep 1, 2019, 8:45:46 PM9/1/19
to cheng...@gmail.com, modin-dev
The documentation is up-to-date for supported methods (and limitations of each implementation): https://modin.readthedocs.io/en/latest/UsingPandasonRay/dataframe_supported.html

The documentation quality can be drastically improved, though. There are a lot of anti-patterns we should compile to a list (e.g. don't loop over the data because it is slow no matter what). Let me know if you have any suggestions about what could help in terms of documentation.

Devin

To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/f1e6923d-3978-46c1-953d-86da3e854ff0%40googlegroups.com.
Message has been deleted

huangfa...@gmail.com

unread,
Sep 2, 2019, 1:39:51 AM9/2/19
to modin-dev
Thanks. Great to see that "P" entries are so few! For such entries, I'd love to see 1 or 2 examples each ...
Reply all
Reply to author
Forward
0 new messages