dplyr::mutate seems unexpectedly slow

1,329 views
Skip to first unread message

Ross Gayler

unread,
Jun 20, 2016, 1:42:57 AM6/20/16
to manipulatr
Hi,

I have been using dplyr to data wrangle a modestly large tbl_df (~2.4 million rows by ~80 columns) on my not particularly fast laptop. Most operations are pleasantly fast, but some are surprisingly (to me) slow).

In the code below, the arrange() %>% group_by() takes ~20 seconds, whereas the relatively trivial mutate() takes ~20 minutes.

There are ~600k groups, each of ~4 rows .

last_file_date <- max(input_files$as_at_date)

# This is quite slow (~20 minutes). The mutate() is unexpectedly slow.
d
%<>%
  ungroup
() %>%
  arrange_
(~ id, ~ as_at_date) %>% # order rows by as-at date within entity
  group_by_
(~ id) %>% # check each entity
  mutate_
(
    first_date
= ~ first(as_at_date), # first observed as-at date for this entity
    last_date
= ~ last(as_at_date), # last observed as-at date for this entity
    last_date_ok
= ~ last_date == last_file_date,
    n_dates
= ~ n(),
    is_consecutive
= ~ (as_at_date == first_date) |
     
(as_at_date == (lag(as_at_date) + months(6))),
    all_consecutive
= ~ all(is_consecutive)
 
)

Does anyone have any suggestions as to why the mutate() is comparatively slow?

The only thought that has occurred to me is that there is a significant overhead per group overhead in mutate and here I have very many tiny groups. If that is the case, there is probably nothing I can do to speed it up without constructing artificially larger groups and simulating the effect of smaller groups in the mutate logic, which would remove the clarity of the code.

Thanks

Ross

Hadley Wickham

unread,
Jun 20, 2016, 10:04:59 AM6/20/16
to Ross Gayler, manipulatr
Can you try creating each new column in it's own mutate? That would
help narrow down if there's a function that's particularly slow, or
it's just overhead related to the many groups.

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.



--
http://hadley.nz

Ross Gayler

unread,
Jun 21, 2016, 5:14:20 AM6/21/16
to manipulatr, r.ga...@gmail.com
Thanks, Hadley. My brain was in neutral yesterday. The problem is clearly in base::months() being appallingly slow.

I took a ~10% sample of the data to speed up testing.

> last_file_date <- max(input_files$as_at_date)
>
> dim(d)
[1] 170165     74
>
> system.time({
+   d %>%
+     ungroup() %>%
+     arrange_(~ id, ~ as_at_date) %>% # order rows by as-at date within entity
+     group_by_(~ id) -> d2 # check each entity
+   })
   user  system elapsed
 
0.817   0.000   0.818
>
> system.time({
+   d %>%
+     ungroup() %>%
+     arrange_(~ id, ~ as_at_date) %>% # order rows by as-at date within entity
+     group_by_(~ id) %>% # check each entity
+     mutate_(
+       first_date = ~ first(as_at_date), # first observed as-at date for this entity
+       last_date = ~ last(as_at_date), # last observed as-at date for this entity
+       last_date_ok = ~ last_date == last_file_date,
+       n_dates = ~ n()
+     ) -> d2
+ })
   user  system elapsed
 
1.798   0.000   1.799
>
> system.time({
+   d %>%
+     ungroup() %>%
+     arrange_(~ id, ~ as_at_date) %>% # order rows by as-at date within entity
+     group_by_(~ id) %>% # check each entity
+     mutate_(
+       first_date = ~ first(as_at_date), # first observed as-at date for this entity
+       last_date = ~ last(as_at_date), # last observed as-at date for this entity
+       last_date_ok = ~ last_date == last_file_date,
+       n_dates = ~ n(),
+       is_consecutive = ~ (as_at_date == first_date) |
+         (as_at_date == (lag(as_at_date) + months(6)))
+     ) -> d2
+ })
   user  system elapsed
 
81.546   0.191  81.806
>
> system.time({
+   months_6 <- months(6)
+   d %>%
+     ungroup() %>%
+     arrange_(~ id, ~ as_at_date) %>% # order rows by as-at date within entity
+     group_by_(~ id) %>% # check each entity
+     mutate_(
+       first_date = ~ first(as_at_date), # first observed as-at date for this entity
+       last_date = ~ last(as_at_date), # last observed as-at date for this entity
+       last_date_ok = ~ last_date == last_file_date,
+       n_dates = ~ n(),
+       is_consecutive = ~ (as_at_date == first_date) |
+         (as_at_date == (lag(as_at_date) + months_6))
+     ) -> d2
+ })
   user  system elapsed
 
19.483   0.187  19.691
>
> system.time({
+   months_6 <- months(6)
+   d %>%
+     ungroup() %>%
+     arrange_(~ id, ~ as_at_date) %>% # order rows by as-at date within entity
+     group_by_(~ id) %>% # check each entity
+     mutate_(
+       first_date = ~ first(as_at_date), # first observed as-at date for this entity
+       last_date = ~ last(as_at_date), # last observed as-at date for this entity
+       last_date_ok = ~ last_date == last_file_date,
+       n_dates = ~ n(),
+       is_consecutive = ~ (as_at_date == first_date) |
+         (as_at_date == (lag(as_at_date) + 180))
+     ) -> d2
+ })
   user  system elapsed
 
3.764   0.004   3.772
>

Including the call to base::months in the mutate increased the run time by > 20 times. Using the result of the call to base::months in the mutate increased the run time by ~ 5 times.

That's pretty amazing.

Ross



Reply all
Reply to author
Forward
0 new messages