reshape data to generate heatmap using ggplot2

274 views
Skip to first unread message

Mahbubul Majumder

unread,
Jun 27, 2011, 2:46:00 PM6/27/11
to ggplot2, stat...@iastate.edu
Hi,

I have a problem to reshape the data using R. The input and output
example data are shown below.

in.dat <- data.frame(ID
=c(1,1,1,2,2,3,3,3),code=c("A","B","D","C","A","B","A","A"))

I need to find some sort of correlation between the codes. For this I
need the count of IDs each code exists with other. For example code
"B" exists with code "A" for 2 IDs (ID = 1 and 3). Thus the output
should look like the out.dat generated by the following codes

A <- c(4,2,1,1)
B <- c(2,2,0,1)
C <- c(1,0,1,1)
D <- c(1,1,1,1)
cds <- c("A","B","C","D")
out.dat <- data.frame(cds,A,B,C,D)

Finally my target is to generate the heatmap as follows.

plot.dat <- melt(out.dat,id="cds")
qplot(cds,variable, geom="tile", fill=value,data=plot.dat,
xlab="Codes", ylab="Codes")

I would appreciate if anyone can help me.

Ista Zahn

unread,
Jun 27, 2011, 5:40:39 PM6/27/11
to Mahbubul Majumder, ggplot2, stat...@iastate.edu
Hi,
It seems like this should be easy, but it was actually a bit tricky
for me. Here is what I came up with:

co.occurance.count <- function(DF) {
code <- as.character(DF$code)
R <- as.data.frame(t(cbind(rbind(unique(code), unique(code)),
combn(code, 2))))
names(R) <- c("code1", "code2")
return(R)
}
tmp <- ddply(in.dat, .(ID), co.occurance.count)
out.dat <- as.data.frame(table(tmp[-1]))

Ugly but effective...

HTH,
Ista

> --
> You received this message because you are subscribed to the ggplot2 mailing list.
> Please provide a reproducible example: http://gist.github.com/270442
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>

--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

Jonathan Kennel

unread,
Jun 28, 2011, 12:20:57 AM6/28/11
to Ista Zahn, Mahbubul Majumder, ggplot2, stat...@iastate.edu
Hello Mahbubul,

It's not perfectly clear how you are counting your combinations from your example (check your A,B,C,D vectors in out.dat) but here is some code that shows a few different ways of doing the counts.  I think you want the 3rd option but am not totally sure from your example as you have a count of 4 for A vs A which is similar to options 2a, b and c.  This should be a fast and simple way of doing it.  The following code does the counts in 5 different ways, and Ista's makes a different and 6th way.

# CREATE TABLES OF COUNTS
A      <- table(in.dat)
Ared <- ifelse(A > 0, 1, 0)     # COUNTS GREATER THAN 1 ARE GIVEN THE VALUE OF 1 FOR THIS TABLE

# 1) MAXIMUM OF EACH (INCLUDES REPLACEMENT)
out.dat <- crossprod(A,A)

# 2a,b, c) COMBINATION COUNT OF EACH VALUE (ASYMMETRIC VERSION WHERE IF YOU HAVE TWO A's AND ONE B
#       IN AN ID FACTOR IT WOULD COUNT AS TWO FOR A VS B, BUT 1 FOR B VS A)
out.dat <- crossprod(A,Ared)  # OR 
out.dat <- crossprod(Ared,A)  # OR MAX OF THE TWO WHICH WILL BE SYMMETRIC
out.dat <- ifelse(crossprod(A,Ared)>crossprod(Ared,A),crossprod(A,Ared),crossprod(Ared,A))

# 3) NUMBER OF UNIQUE COUNTS (ie IF YOU HAVE TWO A's AND ONE B IN AN ID FACTOR IT WOULD 
#     COUNT AS ONE FOR A AND B)
out.dat <- crossprod(Ared,Ared)

# BASE VERSION PLOT
image(out.dat)

#GGPLOT2 VERSION
plot.dat <- melt(data.frame(cds = colnames(out.dat), values = out.dat))
qplot(cds,variable, geom="tile", fill=value,data=plot.dat, xlab="Codes", ylab="Codes")

There is likely a smoother way to plot matrices in ggplot than converting to a dataframe but I do not know it.

HTH,
Jonathan

Mahbubul Majumder

unread,
Jun 28, 2011, 10:14:50 AM6/28/11
to Jonathan Kennel, Ista Zahn, ggplot2
Jonathan,

Many thanks for the solutions you have provided. You were right, I had
some flaw in my example. I was actually looking for the following
solution you gave

A <- table(in.dat)
Ared <- ifelse(A > 0, 1, 0)

out.dat <- ifelse(crossprod(A,Ared)>crossprod(Ared,A),crossprod(A,Ared),crossprod(Ared,A))

I like to thank Ista for the assymetric version of the solution. It
appears that his solution is faster for large data set (at least 1
million records). Can we make it symmetric like the one Jonathan did?

Thanks again for your time and help.

--
Mahbub Majumder
Graduate Student
Dept. of Statistics
Iowa State University

Hadley Wickham

unread,
Jun 28, 2011, 10:37:30 AM6/28/11
to Mahbubul Majumder, Jonathan Kennel, Ista Zahn, ggplot2
On Tue, Jun 28, 2011 at 9:14 AM, Mahbubul Majumder <mahb...@gmail.com> wrote:
> Jonathan,
>
> Many thanks for the solutions you have provided. You were right, I had
> some flaw in my example. I was actually looking for the following
> solution you gave
>
>   A <- table(in.dat)
>   Ared <- ifelse(A > 0, 1, 0)
>   out.dat <- ifelse(crossprod(A,Ared)>crossprod(Ared,A),crossprod(A,Ared),crossprod(Ared,A))

I think you can write this a little more simply (and efficiently) as:

pmax(crossprod(A, Ared), crossprod(Ared, A))

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Jonathan Kennel

unread,
Jun 28, 2011, 10:43:14 AM6/28/11
to Hadley Wickham, Mahbubul Majumder, Ista Zahn, ggplot2
Hadley - Beautiful!

Mahbubul - You may also be interested in:
 
pmin(crossprod(A, Ared), crossprod(Ared, A))

-Jonathan

Mahbubul Majumder

unread,
Jun 28, 2011, 5:00:19 PM6/28/11
to Jonathan Kennel, Hadley Wickham, Ista Zahn, ggplot2
Wonderful! I could plot the partial data based on the solution. For my
complete data this "table" command generates a huge matrix (since I
have about 6 million ID and 1000 codes). On the other hand the
solution provided by Ista does not use "table" command but directly
provides the melted data frame. I just need to make it symmetric. Is
it possible?

I appreciate your help.

Thanks.

--

Mahbubul Majumder

unread,
Jun 29, 2011, 8:57:48 AM6/29/11
to Jonathan Kennel, ggplot2
Jonathan,

Thanks for pmin() function. I used it as follows:

Ared <- pmin(A,1)

Is it faster than ifelse(A > 0, 1, 0)?

--

Jonathan Kennel

unread,
Jul 2, 2011, 1:12:45 AM7/2/11
to Mahbubul Majumder, Hadley Wickham, Ista Zahn, ggplot2
Mahbubul,

Not really anything to do with ggplot2, but this may work for you and may be of interest to some on the list.  Takes about 2 min on my laptop which is about a min too long.  The key is sparse storage since you have so many IDs. 

library(Matrix)  # sparse storage

# INPUT DATA
in.dat <- data.frame(ID = sample(1:6000000, 13000000, replace=TRUE),
                             code = sample(1:1000, 13000000, replace = TRUE))

# USE SPARSE MATRIX STORAGE
A <- xtabs(~ID+code, data = in.dat, sparse = TRUE)
Ared <- A
Ared@x  <- rep(1, length(Ared@x))
out.dat <- pmax(as.matrix(crossprod(A, Ared)),
                as.matrix(crossprod(Ared, A)))

Cheers,
-Jonathan
Reply all
Reply to author
Forward
0 new messages