dplyr filter slow on grouped_df

947 views
Skip to first unread message

Kevin Just

unread,
Apr 23, 2014, 2:30:36 PM4/23/14
to manip...@googlegroups.com
Hi, I have a large 'grouped_df' - around 1 GB, 5 columns, 20M rows.

B %.% filter( C1 %in% uniques_to_keep) takes 3 minutes.  I finally noticed that:
ungroup(B) %.% filter( C1 %in% uniques_to_keep) takes 1 second.

Is this expected?  I think I remember reading something somewhere that this would be expected, but maybe not.

Could someone please point me to documentation if the reason is explained somewhere?

Hadley Wickham

unread,
Apr 23, 2014, 2:50:34 PM4/23/14
to Kevin Just, manipulatr
Hi Kevin,

It's somewhat expected, because if you do g %.% filter(x == max(x))
then the number of times you need to compute max depends on whether
the data is grouped or not. In principle, we should be able to figure
out that C1 %in% uniques_to_keep doesn't vary across groups, but
that's relatively hard to do and we obviously don't pick up on this
case yet. Romain will have more details.

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.



--
http://had.co.nz/

Romain Francois

unread,
Apr 24, 2014, 1:01:06 AM4/24/14
to Hadley Wickham, Kevin Just, manipulatr
It is definitely very hard to decide that some expression could be run on the full data rather than by chunks. I did not find the right way to approach the problem yet. It is essentially a case by case thing. The good news is that we will handle this particular case at some point: https://github.com/hadley/dplyr/issues/126

Now, the difference between 3 minutes and 1 seconds seems quite huge. Could you make some reproducible data for me to play with. Things of relevance are going to be the types of the variables in B, the number of groups in C1 and the length of uniques_to_keep.

Romain

Kevin Just

unread,
Apr 24, 2014, 1:37:40 AM4/24/14
to manip...@googlegroups.com, Hadley Wickham, Kevin Just
I will make a reproducible example as soon as I can after very tight deadline.  This doesn't help much now, but uniques_to_keep is length 20000, number of groups in C1 is about 30000, the 2 grouping vars are integer, character.  Other 3 variables are Date, numeric, and character C1 ( <- paste0( grouping_var_1, '_', grouping_var_2)) (same as uniques_to_keep).

Coincidentally, I found a bug so similar to Hadley's toy example today.  On virtually the same grouped dataset I was doing B %.% filter( min_date == min( min_date) and getting incorrect results ( 2 distinct min_dates), whereas ungroup( B) %.% filter... yields the correct result.  I know there is another topic on this error already, but I will dig into this more as soon as I can in case it's different.  For now, my lesson is to try to remember to ungroup(grouped_df), for better or worse.
Reply all
Reply to author
Forward
0 new messages