How to calculate a matrix of pairwise counts from a 'long' data frame

44 views
Skip to first unread message

Iain Dillingham

unread,
Nov 1, 2012, 12:16:10 PM11/1/12
to manip...@googlegroups.com
Hello everyone,

I have a 'long' data frame with id and featureCode columns. The featureCode column contains values of a categorical variable; each record has between 1 and 9 of these. For example:

id  featureCode
5   PPLC
5   PCLI
6   PPLC
6   PCLI
7   PPL
7   PPLC
7   PCLI
8   PPLC
9   PPLC
10  PPLC

I'd like to calculate the number of times each feature code is used with the other feature codes (the "pairwise counts" of the title). Ultimately, the result would be a matrix. For example:

      PPLC  PCLI  PPL
PPLC  0     3     1
PCLI  3     0     1
PPL   1     1     0

However, I suspect to get this far I need to use plyr (or similar) to produce an intermediate data frame in the form:

id  featureCode1  featureCode2
5   PPLC          PCLI
5   PCLI          PPLC

I scoured the web (and the ggplot2 book, which has a section on plyr) for help and came up with the following:

my_func <- function(df)
{
  with(df, data.frame(
    for (i in 1:length(featureCode))
    {
      for (j in 1:length(featureCode))
      {
        if (i != j)
        {
          featureCode1 = featureCode[i]
          featureCode2 = featureCode[j]
        }
      }
    }
  ))
}

reports.pairs <- ddply(reports.long, .(id), my_func)

Not surprisingly, it doesn't work (I'm an R beginner and come from a Java background). However, I include it here to give you an idea of what I'm trying to do.

Could anyone suggest where I might be going wrong? Thanks in advance for any help.

Iain

Winston Chang

unread,
Nov 1, 2012, 12:38:35 PM11/1/12
to Iain Dillingham, manipulatr
You can convert it to wide format like this:

dat <- read.table(header=T, con <- textConnection('
  id  featureCode
  5   PPLC
  5   PCLI
  6   PPLC
  6   PCLI
  7   PPL
  7   PPLC
  7   PCLI
  8   PPLC
  9   PPLC
  10  PPLC'))
close(con)


# Convert to wide format
library(reshape2)
dat_wide <- dcast(dat, id ~ featureCode)
#   id PCLI  PPL PPLC
# 1  5 PCLI <NA> PPLC
# 2  6 PCLI <NA> PPLC
# 3  7 PCLI  PPL PPLC
# 4  8 <NA> <NA> PPLC
# 5  9 <NA> <NA> PPLC
# 6 10 <NA> <NA> PPLC


After this stage, I'm not sure the best way to count up the pairings. I can think of some not-very-elegant ways to do it, but maybe someone else will have better ideas.

-Winston





Iain

--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To view this discussion on the web visit https://groups.google.com/d/msg/manipulatr/-/vnt_nNbXcYkJ.
To post to this group, send email to manip...@googlegroups.com.
To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.

Peter Meilstrup

unread,
Nov 1, 2012, 1:02:04 PM11/1/12
to Iain Dillingham, manip...@googlegroups.com
er, not sure how that got sent. Try merging the data frame to itself

j <- merge(df, df, by="id")

then count cases:

c <- count(j, c("plcc.x", "plcc.y"))

then convert to matrix:

acast(c, plcc.x ~ plcc.y)

On Thu, Nov 1, 2012 at 9:55 AM, Peter Meilstrup
<peter.m...@gmail.com> wrote:
> Do a join of the dat

Iain Dillingham

unread,
Nov 2, 2012, 8:46:00 AM11/2/12
to manip...@googlegroups.com
Thanks for your help. Unfortunately the merge isn't quite what I'm looking for, as it double counts categories. However, if you're interested I also posted the question on stackoverflow and received some useful advice.

Iain

Peter Meilstrup

unread,
Nov 2, 2012, 7:44:54 PM11/2/12
to Iain Dillingham, manip...@googlegroups.com
Ah, I see. So, if you don't want to count a feature appearing with
itself you finish by setting the diagonal to zero.

m <- acast(c, plcc.x, plcc.y)
diag(m) <- 0

That seems to reproduce your example data?
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/manipulatr/-/ttbW1ZgegPMJ.
Reply all
Reply to author
Forward
0 new messages