Speed Test

17 views
Skip to first unread message

Nick Koenig

unread,
Jan 13, 2020, 5:15:04 PM1/13/20
to modi...@googlegroups.com
I was doing a quick speed test comparison between a .apply() function a .apply(lambda x:) version of the function and an np.where() vectorization of the same function. 

In pandas np.where() is ~10x faster on a df.shape(725000,91). 

However, in modin.pandas it is much slower. Just curious if you all have seen method differences like this and have thoughts. BTW I love the work you all are doing. 
image.png

Devin Petersohn

unread,
Jan 13, 2020, 9:17:23 PM1/13/20
to Nick Koenig, modin-dev
Hi Nick,

Thanks for reaching out, great question. When it comes to `np.where` and other numpy functionalities, there's a lot of room for improvement in Modin. With the recent updates to numpy, we can implement native versions of these and have them run just as fast. The performance degradation you're seeing is the result of us converting the object to a numpy array, then back to a distributed Series. This takes a long time because we're effectively collecting all of the data, merging it, doing the operation (np.where), then resplitting the data. Each of these has a high overhead, which causes the runtime to explode. Thanks for posting this question, feel free to reach out with any others!

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/CAPm9%3D%2BNMz4AajYxbi526L-C%2BZ%2BgFjgHqmRfNVuMoz6HKCecQJQ%40mail.gmail.com.

Nick Koenig

unread,
Jan 14, 2020, 9:17:45 AM1/14/20
to Devin Petersohn, modin-dev
Hi Devin, 

I appreciate the detailed response. That's great to hear. I'm not a python expert, more self-taught but if there is any way I could help out I'd be more than happy to. I love the idea of scaling dataframe offerings within python to speed up performance. 

Thanks again, 

Nick

Devin Petersohn

unread,
Jan 14, 2020, 1:18:10 PM1/14/20
to Nick Koenig, modin-dev
If you're interested in getting involved, that's great! People of all different degrees of experience with Python have contributed to Modin, so there's no issue with that. You are likely to learn something along the way too.

If you want to take a look at this issue, it would be great for a first dive into the code: https://github.com/modin-project/modin/issues/946

It involves changing the default number of columns to improve performance. Let me know what you think or if you have any questions. Questions about the implementation are better on the issue itself so we can keep the discussion in one place.

Devin
Reply all
Reply to author
Forward
0 new messages