I recently asked a question on SO about the tidyverse method for splitting a df by multiple columns.
In the example I provided, I wanted to split a df by two cols and obtain a summary()
output for each subset of the df.
My initial instinct was to use purrr::by_slice()
, but that has been deprecated. The suggested solution uses group_by
followed by do()
, resulting in a list-col
with the summaries:
library(tidyverse)
library(magrittr)
mtcars_summary <-
mtcars %>% select(1:3) %>% mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE), GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% group_by(GRP_A, GRP_B) %>% do(SUMMARY = summary(.))
Here's the structure of the output:
mtcars_summary
#> Source: local data frame [4 x 3]#> Groups: <by row>#> #> # A tibble: 4 × 3#> GRP_A GRP_B SUMMARY #> * <chr> <int> <list>#> 1 A 1 <S3: table>#> 2 A 2 <S3: table>#> 3 B 1 <S3: table>#> 4 B 2 <S3: table>
... and the summaries themselves:
mtcars_summary[["SUMMARY"]]
#> [[1]] #> mpg cyl disp GRP_A #> Min. :14.30 Min. :4 Min. :120.3 Length:7 #> 1st Qu.:18.00 1st Qu.:4 1st Qu.:143.8 Class :character #> Median :21.00 Median :6 Median :160.0 Mode :character #> Mean :20.64 Mean :6 Mean :223.4 #> 3rd Qu.:23.60 3rd Qu.:8 3rd Qu.:317.9 #> Max. :26.00 Max. :8 Max. :360.0 #> GRP_B #> Min. :1 #> 1st Qu.:1 #> Median :1 #> Mean :1 #> 3rd Qu.:1 #> Max. :1 #> #> [[2]]#> mpg cyl disp GRP_A #> Min. :10.40 Min. :4.000 Min. : 71.1 Length:11 #> 1st Qu.:14.95 1st Qu.:5.000 1st Qu.:143.8 Class :character #> Median :16.40 Median :8.000 Median :275.8 Mode :character #> Mean :18.94 Mean :6.545 Mean :247.4 #> 3rd Qu.:20.35 3rd Qu.:8.000 3rd Qu.:334.0 #> Max. :33.90 Max. :8.000 Max. :460.0 #> GRP_B #> Min. :2 #> 1st Qu.:2 #> Median :2 #> Mean :2 #> 3rd Qu.:2 #> Max. :2 #> #> [[3]]#> mpg cyl disp GRP_A #> Min. :15.00 Min. :4.000 Min. : 78.70 Length:6 #> 1st Qu.:19.32 1st Qu.:4.000 1st Qu.: 86.25 Class :character #> Median :21.25 Median :5.000 Median :126.50 Mode :character #> Mean :22.73 Mean :5.667 Mean :185.28 #> 3rd Qu.:26.18 3rd Qu.:7.500 3rd Qu.:262.00 #> Max. :32.40 Max. :8.000 Max. :400.00 #> GRP_B #> Min. :1 #> 1st Qu.:1 #> Median :1 #> Mean :1 #> 3rd Qu.:1 #> Max. :1 #> #> [[4]]#> mpg cyl disp GRP_A #> Min. :10.40 Min. :4.00 Min. : 95.1 Length:8 #> 1st Qu.:15.65 1st Qu.:5.50 1st Qu.:150.2 Class :character #> Median :19.55 Median :6.00 Median :241.5 Mode :character #> Mean :19.21 Mean :6.25 Mean :248.3 #> 3rd Qu.:21.40 3rd Qu.:8.00 3rd Qu.:315.8 #> Max. :30.40 Max. :8.00 Max. :472.0 #> GRP_B #> Min. :2 #> 1st Qu.:2 #> Median :2 #> Mean :2 #> 3rd Qu.:2 #> Max. :2
Main Question: is the use of do()
in this example the recommended way of working with a grouped/split dataframe?
Secondary Question: Is do()
going to be deprecated, as was suggested here?