Large data sets- Scatterplot matrix

1,117 views
Skip to first unread message

Laura Patterson

unread,
May 7, 2015, 2:46:27 PM5/7/15
to davi...@googlegroups.com
Hi
I have a very large data set with 46 variables and when I run the scatterplot matrix in R, I can't plot all variables at once.
Doe anyone have another way to compare correlation among variables all at once?
Continuous and/or categorical data variables.
Thanks

Daniel Fulop

unread,
May 7, 2015, 2:54:39 PM5/7/15
to davi...@googlegroups.com
Hi Laura,

I haven't tested the code, but I think this should do it:

lapply(combn(ncol(x), 2, simplify=FALSE), function(i) cor(x[i][1], x[i][2]))

...where x is a data.frame with your 46 variables as columns and observations as rows.  You can then do a similar lapply with paste() as the function to generate names for the pairwise correlations.

HTH,
Dan.

Laura Patterson wrote:
--
Check out our R resources at http://www.noamross.net/davis-r-users-group.html
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com.
Visit this group at http://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

--
Daniel Fulop, Ph.D.
Postdoctoral Scholar
Dept. Plant Biology, UC Davis
Maloof Lab, Rm. 2220
Life Sciences Addition, One Shields Ave.
Davis, CA 95616

510-253-7462
dfu...@ucdavis.edu

Michael Levy

unread,
May 7, 2015, 3:09:07 PM5/7/15
to davi...@googlegroups.com
Hi Laura,

You might also check out the corrplot package:




-- 
Michael Levy
c: 304-376-4523

Laura Patterson

unread,
May 7, 2015, 3:36:20 PM5/7/15
to davi...@googlegroups.com
Thanks.
Do i need to replace i with anything?
And do I need x[i][1], x[i][2]-for every variable, i.e. up to 46?

Daniel Fulop

unread,
May 7, 2015, 3:58:23 PM5/7/15
to davi...@googlegroups.com
No, you don't need to replace i with anything.

lapply is doing it for every variable, in the combinations specified by combn().

combn(,2) will output every pairwise combination of the 46 columns/variables (which relates to the i) and then the [1] and [2] relates to the 1st and 2nd variable in each of the pairwise combinations.

Laura Patterson

unread,
May 7, 2015, 4:06:06 PM5/7/15
to davi...@googlegroups.com
Thank you! That looks perfect.
However, I get this error message:
Error in cor(Broiler) : 'x' must be numeric
Can I only include continuous variables or numerical categorical variables?
Laura

Daniel Fulop

unread,
May 7, 2015, 4:32:54 PM5/7/15
to davi...@googlegroups.com
Actually, cor() and cov() accept whole matrices or data.frames, so you could also do cor(x), where x is your whole data as a data.frame or matrix with variables as columns.  That is, there's no need for lapply, etc.

I don't know what to say about categorical/factor variables. I don't know of a good correlation measure for a continuous and a categorical variable.  I think cor(method="spearman") or cor(method="kendall") will run, but what the correlation value means in such cases is hard to interpret.

I would get a cor() value for all the paris of continuous variables, and then examine the relationship of each categorical variable to the rest of the data on its own.  For the latter you can use generalized pairs plots as implemented in the gpairs package and the ggpairs() function of the GGally package; they have many ways of plotting different kinds of data.  Another option is to use scagnostics (http://cran.r-project.org/web/packages/scagnostics/index.html) to explore the relationship between each pair of variables.

shriram gajjar

unread,
May 8, 2015, 12:45:19 AM5/8/15
to davi...@googlegroups.com
I use parallel coordinate plot to look at data correlations in a large data set. Hope it helps. 
Regards
Shriram 

Laura Patterson

unread,
May 8, 2015, 12:51:51 AM5/8/15
to davi...@googlegroups.com
I got this error message:do you know what might be wrong? I have a mix of continous and categorical data. tnx

Error in cor(Broiler[i][1], Broiler[i][2]) : 'x' must be numeric

Brandon Hurr

unread,
May 8, 2015, 1:07:04 AM5/8/15
to davi...@googlegroups.com
Laura. What kind of categorical data are we talking about? Is it ordered or is the order irrelevant?
I think most correlation methods mentioned so far require numeric data. Could you provide the str() or dput() of your dataset and maybe a more thorough description of the categorical variables? B
--

Laura Patterson

unread,
May 8, 2015, 1:22:12 PM5/8/15
to davi...@googlegroups.com
Hi
Here is the str below. 
The data is a mix of continuous, categorical:nominal (type of farm: Organic or conventional), categorical:ordinal (day of the week) and discrete (Farm ID) variables.
Seems like I might need to change the continous variables to categorical so they all match? Or is there a way to compare these different types of variables?
Thank you for your help.
Laura

'data.frame': 483 obs. of  57 variables:
 $ CloseWeek            : Factor w/ 31 levels "&","0","0.25",..: 15 15 15 15 15 15 15 15 25 25 ...
 $ RanchID              : Factor w/ 62 levels "","0","0.25",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ HatchDate            : Factor w/ 133 levels "","0","41432",..: 58 62 53 64 56 59 59 62 110 113 ...
 $ KillDate             : int  41570 41576 41566 41578 41569 41572 41573 41577 41635 41638 ...
 $ BreederLoc           : Factor w/ 3 levels "","Breeder","MIX": 2 2 2 2 2 2 2 2 2 2 ...
 $ Hatchery             : Factor w/ 4 levels "","1","2","Hatchery": 2 2 2 2 2 2 2 2 2 2 ...
 $ BreederAge           : Factor w/ 3 levels "","B","Hatchery": 3 3 3 3 3 3 3 3 3 3 ...
 $ BirdType             : Factor w/ 4 levels "","A","B","mix": 2 2 2 2 2 2 2 2 2 2 ...
 $ RanchType            : Factor w/ 8 levels "","contract",..: 6 6 6 6 6 6 6 6 4 6 ...
 $ FCPU                 : Factor w/ 6 levels "","abx","contract",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Livability           : Factor w/ 51 levels "","1.765474059",..: 50 50 50 50 50 50 50 50 50 50 ...
 $ FirstWkMort          : num  1.92 1.92 1.92 1.92 1.92 ...
 $ SecondWkMort         : num  0.947 0.947 0.947 0.947 0.947 ...
 $ DensityIn            : num  0.00948 0.00948 0.00948 0.00948 0.00948 ...
 $ DensityOut           : num  0.00238 0.00238 0.00238 0.00238 0.00238 ...
 $ LitterRun            : num  1 1 1 1 1 ...
 $ Downtime             : num  1.06 1.06 1.06 1.06 1.06 ...
 $ AgePU                : num  5 5 5 5 5 5 5 5 6 6 ...
 $ KillWeight           : num  25.1 25.1 25.1 25.1 25.1 ...
 $ Leukosis             : num  43 44 45 44 44 44 45 45 45 44 ...
 $ Septox               : num  5.17 5.06 5.63 5.46 5.29 5.43 5.71 5.73 5.67 5.17 ...
 $ WCond                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DressTrim            : num  0.21 0.52 0.14 0.37 0.18 0.11 0.16 0.4 0.34 0.36 ...
 $ PlantDOA             : num  0.34 0.89 0.23 0.62 0.34 0.16 0.27 0.58 0.51 0.53 ...
 $ TotalCond            : num  0.28 0.19 0.18 0.21 0.36 0.26 0.22 0.22 0.18 0.23 ...
 $ Gall                 : num  0.1 0.14 0.09 0.1 0.08 0.08 0.1 0.11 0.15 0.08 ...
 $ Fecal                : num  0.96 1.39 0.66 1.04 1.07 0.68 0.68 1.04 1.08 1.02 ...
 $ Airsac9              : num  0.82 1.07 1.06 0.78 1.04 0.53 0.53 1.14 1.26 0.49 ...
 $ Airsac10             : num  3.29 3.79 2.24 3.32 2.08 2.09 3.8 3.27 2.34 1.32 ...
 $ Airsac11             : num  0.00299 0.00931 0.01148 0.02379 0.0029 ...
 $ Airsac12             : num  0 0 0 0 0 ...
 $ Airsac13             : num  0 0 0 0 0 ...
 $ DiseasedLeg          : num  4.89 10.35 2.65 8.07 3.09 ...
 $ totalas              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DiseaseInv           : num  0.2125 0.273 0.1493 0.11 0.0813 ...
 $ TotalCellt           : num  4.89 10.36 2.66 8.09 3.09 ...
 $ ProcessTime          : num  6.02 11.87 3.59 9.49 4.22 ...
 $ Hoursnofeed          : Factor w/ 276 levels "","0","0.25",..: 32 19 23 33 31 8 66 47 64 46 ...
 $ KillLine             : Factor w/ 245 levels "","1","1:00:00",..: 49 49 49 49 49 49 49 49 152 35 ...
 $ Plant                : Factor w/ 61 levels "","0.334027778",..: 61 61 61 61 61 61 61 61 58 61 ...
 $ ProcessDay           : Factor w/ 304 levels "","-0.025","-0.615972222",..: 19 19 19 19 19 19 19 19 251 130 ...
 $ BirdType2            : Factor w/ 12 levels "","1","2","3",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ MortPlaced           : Factor w/ 19 levels "","1","2","20370",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ MortDOA              : Factor w/ 19 levels "","130","21",..: 19 18 15 17 18 13 15 19 13 14 ...
 $ MortWeek1            : Factor w/ 27 levels "","110","113",..: 24 24 24 24 24 24 24 24 24 24 ...
 $ MortWeek2            : Factor w/ 82 levels "","&","0","1",..: 82 82 82 82 82 82 82 82 82 82 ...
 $ MortLastWeek         : Factor w/ 84 levels "","&","0","10",..: 79 79 79 79 79 79 79 79 79 79 ...
 $ MortTotal            : Factor w/ 82 levels "","&","0","103",..: 78 78 78 78 78 78 78 78 78 78 ...
 $ NDVTiters            : Factor w/ 67 levels "","0","0.25",..: 59 59 59 59 59 59 59 59 59 59 ...
 $ IBVTiters            : Factor w/ 74 levels "","0","0.125",..: 72 73 72 72 73 73 73 73 72 72 ...
 $ FANS                 : Factor w/ 63 levels "","0","0.25",..: 30 30 2 2 2 2 2 2 2 2 ...
 $ LGHTS                : Factor w/ 20 levels "","&","0","0.142857143",..: 8 8 1 1 1 1 1 1 1 1 ...
 $ FUEL                 : Factor w/ 17 levels "","&","0","0.25",..: 3 3 1 1 1 1 1 1 1 1 ...
 $ SalmonRehang         : Factor w/ 14 levels "","&","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SalmonBootSwabsno    : Factor w/ 10 levels "","0","2","diesel",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SalmonBootSwabsPos   : Factor w/ 9 levels "","0","0.5","2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SalmonCarcassRinsePos: Factor w/ 7 levels "","&","0","0.25",..: 1 1 1 1 1 1 1 1 1 1 ...

Daniel Fulop

unread,
May 8, 2015, 1:30:09 PM5/8/15
to davi...@googlegroups.com
Generalized pairs plots should allow you to visually assess correlations among categorical and continuous data. For example, for categorical:nominal you could use bloxplots to see if there's a relationship between factor levels and a continuous var. (i.e. to see if the boxplots' median and distributions are similar or quite different), etc.  Read this paper: http://vita.had.co.nz/papers/gpp.pdf  ...and play around with plotting your data in different ways to explore patterns.

Daniel Fulop

unread,
May 8, 2015, 1:32:29 PM5/8/15
to davi...@googlegroups.com
This is the link to the final peer-reviewed version of the paper: http://www.tandfonline.com/doi/abs/10.1080/10618600.2012.694762

Laura Patterson

unread,
May 16, 2015, 3:52:36 PM5/16/15
to davi...@googlegroups.com
Hi
Others in my class have used corrplot with success for our data set, that contains categorial and continous variables. However I keep getting stuck with error messages.
Am I missing code? If i have 46 variables, do i need to insert those?
Tnx

Warning messages:
1: In Ops.factor(d[, 1L], w) : ‘*’ not meaningful for factors
2: In Ops.factor(d[, 1L], d[, 2L]) : ‘*’ not meaningful for factors
3: In Ops.factor(d[, 1L], 2) : ‘^’ not meaningful for factors

My R code:
install.packages("corrplot") 
Library(corrplot) 
COR<-corr(Broiler) 
corrplot(M, method="circle")

On Thursday, May 7, 2015 at 12:09:07 PM UTC-7, Michael wrote:

Laura Patterson

unread,
May 16, 2015, 4:28:16 PM5/16/15
to davi...@googlegroups.com
I found another line of code but now the next line isn't working and says I need a data frame.

My R code for corrplot:
test <- matrix(data=c(1:53),nrow=464,ncol=53)   <NOTE: If my .csv file is 464 rows and 53 columns- is this what i said in the code?)
COR<-corr(test) 
corrplot(COR, method="circle")

Error message:
Error in corrplot(COR, method = "circle") : Need a matrix or data frame!

Thanks!

Michael Levy

unread,
May 16, 2015, 4:38:57 PM5/16/15
to davi...@googlegroups.com

Hi Laura,

The first argument to corrplot is a correlation matrix, not the raw dataframe. If all the columns in your dataframe were numeric, you could use cor(thedataframe) (not sure what the function corr() with two r’s is) to generate the correlation matrix and then pass that along to corrplot(). However, as others have pointed out,

​calculating correlations for categorical variables is a bit more nuanced.


-- 
Michael Levy
c: 304-376-4523


Reply all
Reply to author
Forward
0 new messages