Grouped correlation for more than two variables

493 views
Skip to first unread message

Roman B

unread,
Dec 25, 2017, 1:16:36 PM12/25/17
to manipulatr
Hi, 

My name is Roman and I'm new to this group. I'm working with some IMDB data and was wondering how to get correlations for multiple variables by group using tidyverse functions. For example, I can get correlations for two variables like below, but I don't know how to do it for more than two or even all the variables in the dataset. 

I'd like to be able to see correlations for any number of selected variables by group i.e. if I wanted to see the correlation stats between gross, imdb_score, and budget grouped by content_rating for example. 

movies %>% group_by(content_rating) %>% 
  summarise(cor = cor(gross,imdb_score))

 content_rating        cor
            <chr>      <dbl>
 1       Approved  0.1279158
 2              G  0.4070052
 3             GP         NA
 4              M  1.0000000
 5          NC-17 -0.7905857
 6      Not Rated  0.2543184
 7         Passed  0.9909737
 8             PG  0.3130682
 9          PG-13  0.3216226
10              R  0.2146594
11        Unrated  0.1831069
12              X  0.1597880

Attaching the data here. Thanks all for your time and help. 

movies_clean.csv

eipi10

unread,
Jan 9, 2018, 5:50:29 PM1/9/18
to manipulatr
Hi Roman,

You could do correlations by group by splitting the data frame by the grouping variable and then feeding each group to the map function. For example, using the built-in iris data frame:

library(tidyverse)

correlations
= split(iris, iris$Species) %>%
  map
(~cor(.x %>% select(-Species)))

This will return a list where each element is the correlation matrix for all the variables in the data frame, grouped by Species. In your case, the code would be:

split(movies, movies$content_rating) %>%
  map
(~cor(.x %>% select_if(is.numeric)))

eipi10

unread,
Jan 9, 2018, 5:56:11 PM1/9/18
to manipulatr
One more thing: If you want all the results in a single data frame, you could do the following:

split(movies, movies$content_rating) %>%
   map_df
(~cor(.x %>% select_if(is.numeric)) %>%
                 
as.data.frame() %>%
                  rownames_to_column
(var="row") %>%
                 
gather(column_spec, correlation,-row),
         
.id="content_rating")


On Monday, December 25, 2017 at 10:16:36 AM UTC-8, Roman B wrote:

eipi10

unread,
Jan 9, 2018, 5:58:02 PM1/9/18
to manipulatr
In my previous post "column_spec" should be just "column" (without the quotes). That was an autocomplete glitch.


On Monday, December 25, 2017 at 10:16:36 AM UTC-8, Roman B wrote:
Reply all
Reply to author
Forward
0 new messages