Transporting mclust() plot elements into ggplot

1,550 views
Skip to first unread message

FishGradGuy

unread,
Jan 27, 2011, 12:03:03 AM1/27/11
to ggplot2
All,

I'm using mclust() to perform cluster analysis on some data. The plots
that are automatically generated by mclust contain cluster information
both as shape and color, which would be easy enough to pull into
ggplot. Several of the default plots also contain cluster centers that
are plotted as ellipses determined by the variance of the clusters.
I've done the rest of the plots for my thesis in ggplot and would like
to be able to transport these plots into ggplot as well to keep things
the same.

I am wondering if anyone is familiar with how mclust() plots are
created and what elements I need to pull out to create the cluster
center ellipses in ggplot. I've been trying to figure it out but I'm
not getting it.

Thanks in advance

Brandon Hurr

unread,
Jan 31, 2011, 12:02:01 PM1/31/11
to FishGradGuy, ggplot2
FGG, 

Do you have some example data we could play with? I once tried to do something similar to make a dendrogram but didn't get very far. Perhaps we could pool our efforts? 

From what I recall the structure of the cluster that the function produced wasn't that easy to decipher. It had distances where the splits were, but then you couldn't tell what was splitting from what at that point (at least I couldn't). 

B


--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

Jens Hegg

unread,
Feb 14, 2011, 8:16:01 PM2/14/11
to Brandon Hurr, ggplot2
Brandon,

Sorry to leave this for so long without a reply. Here is an example of what I'm wanting to do.

> iris_mclust=Mclust(iris[,1:4])
> plot(iris_mclust, data=iris[,1:4])

In the bivariate plot of Sepal.Length and Sepal.Width that comes up third (after the BIC plot and pairs plot) the cluster centers are demarcated with an X and there is a variance ellipse around it which corresponds to the cluster direction and shape. It isn't hard to plot the groupings using  iris_mclust$classification for point color or shape. I want to make a plot in ggplot that includes the cluster centers and variance ellipses as well, but I can't figure out how those are plotted.

If you, or anyone else, has input on how those ellipses are plotted and how to use them in ggplot let me know.

Thanks,

Jens 

Brandon Hurr

unread,
Feb 15, 2011, 2:31:18 AM2/15/11
to Jens Hegg, ggplot2
Cluster centers appear to be under iris_mclust$parameters$mean

$mean
                  [,1]     [,2]
Sepal.Length 5.0060021 6.261996
Sepal.Width  3.4280046 2.871999
Petal.Length 1.4620006 4.905993
Petal.Width  0.2459998 1.675997

cluster 1 is centered at 5 x 3.43 and 2 @ 6.26 x 2.87

Should be able to plot these on the same plot as your colored points using geom_point() with a special shape.  

I'm guessing, but the elipses must be under iris_mclust$parameters$variance... 

Should be able to plot these using grobs. Just last week or so B. Bertlesman asked about plotting shapes. The issue as usual is figuring out what ggplot needs to plot these shapes and how to use the information in mclust to get these values. 

Brandon

Charlotte Wickham

unread,
Feb 15, 2011, 2:18:06 PM2/15/11
to Brandon Hurr, Jens Hegg, ggplot2
Here's an attempt:
library(ellipse)
library(mclust)
library(ggplot2)

iris_mclust=Mclust(iris[,1:4])

get.ellipses <- function(coords, mclust.fit){
centers <- mclust.fit$parameters$mean[coords, ]
vars <- mclust.fit$parameters$variance$sigma[coords, coords, ]
ldply(1:ncol(centers), function(cluster){
data.frame(ellipse(vars[,,cluster], centre = centers[, cluster],
level = 0.5), classification = cluster)
})
}

iris.el <- get.ellipses(c("Sepal.Length", "Sepal.Width"), iris_mclust)
iris$classification <- iris_mclust$classification

ggplot(iris, aes(Sepal.Length, Sepal.Width, colour = factor(classification))) +
geom_point(aes(shape = classification))+
geom_path(data = iris.el,
aes(group = classification, linetype = classification))

The size of the ellipses is the tricky part, from what I can tell
mclust uses some classification probability level. I just used the 50%
quantile for an equivalent multivariate normal (the level parameter in
ellipse).

Hope it helps,
Charlotte

Brandon Hurr

unread,
Feb 15, 2011, 2:32:25 PM2/15/11
to Charlotte Wickham, Jens Hegg, ggplot2
Charlotte, 

Awesome. 

I've added what you have to what I have... 

require(mclust)

require(ggplot2)

require(ellipse)


#if function, input is original dataset

orig.data<-iris[1:4]


data.mclust<-Mclust(orig.data)


BIC.data<-as.data.frame(data.mclust$BIC)

BIC.data$NumComp<-rownames(BIC.data)

melted.BIC<-melt(BIC.data, var.ids= "NumComp")


ggplot(melted.BIC, aes(x=as.numeric(NumComp), y=value, colour=variable, group=variable))+

scale_x_continuous("Number of Components")+

scale_y_continuous("BIC")+

scale_colour_hue("")+

geom_point()+

geom_line()+

theme_bw()



#plotting correlation matrix of original data colored by model classification

#stripped from plotmatrix()

mapping=aes(colour=as.factor(data.mclust$classification), shape=as.factor(data.mclust$classification))


grid <- expand.grid(x = 1:ncol(orig.data), y = 1:ncol(orig.data))

grid <- subset(grid, x != y)

all <- do.call("rbind", lapply(1:nrow(grid), function(i) {

  xcol <- grid[i, "x"]

  ycol <- grid[i, "y"]

  data.frame(xvar = names(orig.data)[ycol], yvar = names(orig.data)[xcol], 

  x = orig.data[, xcol], y = orig.data[, ycol], orig.data)

  }))

all$xvar <- factor(all$xvar, levels = names(orig.data))

all$yvar <- factor(all$yvar, levels = names(orig.data))

densities <- do.call("rbind", lapply(1:ncol(orig.data), function(i) {

data.frame(xvar = names(orig.data)[i], yvar = names(orig.data)[i], 

x = orig.data[, i])

}))

mapping <- defaults(mapping, aes_string(x = "x", y = "y"))

class(mapping) <- "uneval"

ggplot(all, mapping) + 

facet_grid(xvar ~ yvar, scales = "free") + 

geom_point(na.rm = TRUE) + 

stat_density(aes(x = x, y = ..scaled.. * diff(range(x)) + min(x)), data = densities, position = "identity", colour = "grey20", geom = "line") +

opts(legend.postion= "none")


#cant get rid of the damn legend

#it would be a bunch faster if it only plotted 1 v 2 and not also 2 v 1... cest la vie



#scatter plot of width by length

#color and shape by classification



#center of cluster by data.mclust$parameters$mean


get.ellipses <- function(coords, mclust.fit){

centers <- mclust.fit$parameters$mean[coords, ]

vars <- mclust.fit$parameters$variance$sigma[coords, coords, ]

ldply(1:ncol(centers), function(cluster){

  data.frame(ellipse(vars[,,cluster], centre = centers[, cluster],

level = 0.5), classification = cluster)

  })

}


iris.el <- get.ellipses(c("Sepal.Length", "Sepal.Width"), data.mclust)

orig.data$classification <- data.mclust$classification


ggplot(orig.data, aes(Sepal.Length, Sepal.Width, colour = factor(classification))) +

geom_point(aes(shape = classification))+

geom_path(data = iris.el,

aes(group = classification, linetype = classification))


Getting there... 


Brandon

Dennis Murphy

unread,
Feb 15, 2011, 10:44:11 PM2/15/11
to Brandon Hurr, Charlotte Wickham, Jens Hegg, ggplot2
Hi:

How about this?

library(mclust)
library(ggplot2)

# iris_mclust=Mclust(iris[,1:4])

# Charlotte Wickham's function to generate ellipse coordinates


get.ellipses <- function(coords, mclust.fit){
 centers <- mclust.fit$parameters$mean[coords, ]
 vars <- mclust.fit$parameters$variance$sigma[coords, coords, ]
 ldply(1:ncol(centers), function(cluster){
   data.frame(ellipse(vars[,,cluster], centre = centers[, cluster],
     level = 0.5), classification = cluster)
   })
}

# Function to generate data frame for producing ellipses (DM)

df.ellipses <- function(mclustobj){
   # Takes an mclust object as input.
   # Outputs a data frame that contains information about ellipse
   # coordinates for all pairs of variables in the mclust object.
    require(ellipse)
    require(plyr)
    nms <- rownames(mclustobj$parameters$mean)
    n <- length(nms)
    grid <- expand.grid(x = 1:n, y = 1:n)

    grid <- subset(grid, x != y)
    grid <- cbind(grid[, 2], grid[, 1])
    ldply(1:nrow(grid), function(i){
        coords <- as.numeric(grid[i, ])
        ell <- get.ellipses(c(nms[coords]), mclustobj)
        names(ell) <- c('yval', 'xval', 'classification')
        data.frame(xvar = nms[coords[1]], yvar = nms[coords[2]], ell)
   })
}

# Brandon's plot code to graph BIC values by columns of the BIC component of
# the mclust object

orig.data<-iris[1:4]   


data.mclust<-Mclust(orig.data)           # same as iris_mclust

BIC.data<-as.data.frame(data.mclust$BIC)
BIC.data$NumComp<-rownames(BIC.data)
melted.BIC<-melt(BIC.data, var.ids= "NumComp")


ggplot(melted.BIC, aes(x=as.numeric(NumComp), y=value, colour=variable, group=variable))+
   labs(x = 'Number of components', y = 'BIC', colour = '') +
# scale_x_continuous("Number of Components")+
# scale_y_continuous("BIC")+
# scale_colour_hue("")+
geom_point()+
geom_line()+
theme_bw()

# Brandon's code, slightly modified by DM, to produce a scatterplot matrix
# that includes confidence ellipses for the classifications and groupwise density
# plots along the diagonal.


mapping=aes(colour=as.factor(data.mclust$classification),
            shape=as.factor(data.mclust$classification))


grid <- expand.grid(x = 1:ncol(orig.data), y = 1:ncol(orig.data))
grid <- subset(grid, x != y)

# Data frame for plotting points in each panel
all <- ldply(1:nrow(grid), function(i) {

  xcol <- grid[i, "x"]
  ycol <- grid[i, "y"]
  data.frame(xvar = names(orig.data)[ycol], yvar = names(orig.data)[xcol],
  x = orig.data[, xcol], y = orig.data[, ycol], orig.data)
  })
all$xvar <- factor(all$xvar, levels = names(orig.data))
all$yvar <- factor(all$yvar, levels = names(orig.data))

# Data frame for plotting density estimates
densities <- ldply(1:ncol(orig.data), function(i)

       data.frame(xvar = names(orig.data)[i], yvar = names(orig.data)[i],
                  x = orig.data[, i],
                  classification = as.factor(data.mclust$classification)))

d <- ldply(1:ncol(orig.data), function(i)
       data.frame(xvar = names(orig.data)[i], x = orig.data[, i])


mapping <- defaults(mapping, aes_string(x = "x", y = "y"))
class(mapping) <- "uneval"

# Data frame for ellipse info:
u <- df.ellipses(data.mclust)

g <- ggplot(all)
g + facet_grid(xvar ~ yvar, scales = "free") +
    geom_point(map = mapping, na.rm = TRUE, size = 1.5) +
    stat_density(data = densities,
                 aes(x = x, y = ..scaled.. * diff(range(x)) + min(x),
                       colour = classification),
                 position = "identity", geom = "line", size = 1) +
    geom_path(data = u, aes(x = xval, y = yval, group = classification,
                              linetype = classification), size = 1) +
    opts(legend.position= "none")

HTH,
Dennis
discscat.pdf

Brandon Hurr

unread,
Feb 16, 2011, 4:35:07 AM2/16/11
to Dennis Murphy, Charlotte Wickham, Jens Hegg, ggplot2
Dennis, 

That's very slick. I was wondering how to go about using Charlotte's code to run the ellipse code on the other pairs, but it makes perfect sense to put them in the scatterplot matrix. I'm wondering about a few things. 
  • The density plots now represent the two different clusters in the iris dataset, does this make more sense than having a representation of the entire distribution?
  • It'd be nice to eliminate the grid object either in the top right or bottom left half of the matrix, less plotting = less CPU time and you're not really getting more information with graphs 'A' vs 'B' and 'B' vs 'A', can you put empty grid objects in a pattern like that? I've seen it done with geom_tile() for a similar correlation matrix, but that's with one plot, not multiple ones like in this instance. 
  • Lastly, there's one more plot in the output of plot(Mclust()) ... "uncertainty"... which appears to be done by using scale_size and perhaps scale_alpha on geom_point() in the scatterplot. However, I'm really 'uncertain' as to where the levels for this scaling comes from. 
Brandon

Dennis Murphy

unread,
Feb 16, 2011, 5:29:48 AM2/16/11
to Brandon Hurr, ggplot2
Hi Brandon:

On Wed, Feb 16, 2011 at 1:35 AM, Brandon Hurr <brando...@gmail.com> wrote:
Dennis, 

That's very slick. I was wondering how to go about using Charlotte's code to run the ellipse code on the other pairs, but it makes perfect sense to put them in the scatterplot matrix. I'm wondering about a few things. 
  • The density plots now represent the two different clusters in the iris dataset, does this make more sense than having a representation of the entire distribution?

That's up to you; I just tried to provide an option. I just thought that if you had two groups of classifications, you might be interested in density plots for each. It's really a matter of taste, IMO...but it does explain the bimodality in the overall density plots (see example below for what I mean).
  • It'd be nice to eliminate the grid object either in the top right or bottom left half of the matrix, less plotting = less CPU time and you're not really getting more information with graphs 'A' vs 'B' and 'B' vs 'A', can you put empty grid objects in a pattern like that? I've seen it done with geom_tile() for a similar correlation matrix, but that's with one plot, not multiple ones like in this instance. 
I agree it's redundant. Lattice has panel functions that allow you to use one type of graphic in one off-diagonal and another type in the other. You managed to isolate the density plots along the diagonal panels; perhaps you can NA the values in the off-diagonal you don't want or set up the grid so that either x <= y or x >= y and line up df.ellipses accordingly. Let's try the idea out.

I redid the code to set the grid subset to be x < y in the initial grid and in df.ellipses() and got the following to work:

grid <- expand.grid(x = 1:ncol(orig.data), y = 1:ncol(orig.data))
grid <- subset(grid, x < y)


all <- ldply(1:nrow(grid), function(i) {
  xcol <- grid[i, "x"]
  ycol <- grid[i, "y"]
  data.frame(xvar = names(orig.data)[ycol], yvar = names(orig.data)[xcol],
  x = orig.data[, xcol], y = orig.data[, ycol], orig.data)
  })

all$xvar <- factor(all$xvar, levels = names(orig.data))
all$yvar <- factor(all$yvar, levels = names(orig.data))

u <- df.ellipses(data.mclust)

g <- ggplot(all)
g + facet_grid(xvar ~ yvar, scales = "free") +
    geom_point(map = mapping, na.rm = TRUE) +
    stat_density(aes(x = x, y = ..scaled.. * diff(range(x)) + min(x)),
#                     shape = as.factor(data.mclust$classification)),
                 data = densities, position = "identity", colour = "grey20", geom = "line") +
    geom_path(data = u, aes(x = xval, y = yval, group = classification,
                            linetype = classification), size = 1) +
    opts(legend.position= "none")

Perhaps one can generate a different data frame to produce another type of plot in the other off-diagonal and add it as a layer to the one above...

  • Lastly, there's one more plot in the output of plot(Mclust()) ... "uncertainty"... which appears to be done by using scale_size and perhaps scale_alpha on geom_point() in the scatterplot. However, I'm really 'uncertain' as to where the levels for this scaling comes from. 

Since I couldn't get the plot to render from data.mclust (I only got the BIC plot), I'm uncertain what you mean :)

HTH,
Dennis

Dennis Murphy

unread,
Feb 16, 2011, 5:37:07 AM2/16/11
to Brandon Hurr, ggplot2
I just noticed I forgot to include the modified df.ellipses() function - the only necessary change is, within the subset() statement, to replace x != y with x < y and then redefine the data frame I called u. A copy of the graph it should produce is included. I put this at the top of the previously posted code. Sorry for the inconvenience.

Dennis

df.ellipses <- function(mclustobj){

    require(ellipse)
    require(plyr)
    nms <- rownames(mclustobj$parameters$mean)
    n <- length(nms)
    grid <- expand.grid(x = 1:n, y = 1:n)
    grid <- subset(grid, x < y)                         # <<===

    grid <- cbind(grid[, 2], grid[, 1])
    ldply(1:nrow(grid), function(i){
        coords <- as.numeric(grid[i, ])
        ell <- get.ellipses(c(nms[coords]), mclustobj)
        names(ell) <- c('yval', 'xval', 'classification')
        data.frame(xvar = nms[coords[1]], yvar = nms[coords[2]], ell)
   })
}

u <- df.ellipses(data.mclust)
discscat2.png

Brandon Hurr

unread,
Feb 16, 2011, 5:40:04 AM2/16/11
to Dennis Murphy, ggplot2
Dennis, 

By last plot I meant the final plot from the following:
iris_mclust=Mclust(iris[,1:4])
plot(iris_mclust, data=iris[,1:4])

Just got your updated ellipse function, the graph looks great... how do we tell if it's actually faster? 

Brandon

P.S. I hope Jens shows up sometime soon. 
plot1.png
plot2.png
plot3.png
plot4.png

Brandon Hurr

unread,
Feb 16, 2011, 6:56:26 AM2/16/11
to Dennis Murphy, ggplot2
Dennis, 

I wrapped it in a function call and had the plots output as pngs. 

ggMclust(iris[1:4])

...and you get the two pictures. Beautiful. 

Still don't know what to do with the "uncertainty" graph, but you could use the extra spaces in the scatter matrix plot like you suggested. 

Brandon
scatter.matrix.png
BIC.plot.png
mclusthelp.r

Dennis Murphy

unread,
Feb 16, 2011, 8:41:53 AM2/16/11
to Brandon Hurr, ggplot2
Hi Brandon:

Re timings:

Full set of 16 panels (10 replicates)
   user  system elapsed
   1.17    0.00    1.17

Reduced set of 10 panels (10 replicates)
   user  system elapsed
   0.62    0.01    0.64

It appears to be almost twice as fast (independent of the time it takes to render the graph in a graphics device, obviously :) The time bottleneck is the translation to the graphics device - I would expect the simpler graph to render a bit more quickly, but I doubt it will take half the time.

As for the uncertainty, it appears the sizes in the original plot are related to the 0.8 and 0.95 quantiles of the uncertainty distribution:
> with(data.mclust, sum(uncertainty > quantile(uncertainty, 0.8)))
[1] 30
> with(data.mclust, sum(uncertainty > quantile(uncertainty, 0.95)))
[1] 8
where
> quantile(data.mclust$uncertainty, c(0.8, 0.95))
         80%          95%
1.049008e-10 1.533001e-07

The range is between 0 and 0.0002; the maximum is a distinct outlier, corresponding to row 42 of the iris data.

Alas, alpha is a strictly continuous aesthetic, as far as I can tell; moreover, it has no manual scale. Size, OTOH, can be mapped to levels of an ordered factor (as shown on the on-line help page for scale_size), and I managed to get something done in that respect. The problem is that the values with the 'large' uncertainties are exclusively in class 1, and since their values are clustered more tightly together than those in class 2, there is a danger of overplotting by use of size. For that reason, I constrained it to be between 1 and 2 in the code below.

I added a 'Quantile' (ordered) factor corresponding to the relative position of the uncertainty per the above quantile scheme, and added it to data frame all. I also changed the mapping variable by adding a size aesthetic and a manual scale to set the actual sizes of the points. FInally, I pulled the scales inside the graph and made some text size adjustments in opts().

mapping=aes(colour=as.factor(all$classification),
            shape=as.factor(all$classification),
            size = all$Quantile   )

mapping <- defaults(mapping, aes_string(x = "x", y = "y"))
class(mapping) <- "uneval"

df.ellipses <- function(mclustobj){

    require(ellipse)
    require(plyr)
    nms <- rownames(mclustobj$parameters$mean)
    n <- length(nms)
    grid <- expand.grid(x = 1:n, y = 1:n)
    grid <- subset(grid, x < y)
    grid <- cbind(grid[, 2], grid[, 1])
    ldply(1:nrow(grid), function(i){
        coords <- as.numeric(grid[i, ])
        ell <- get.ellipses(c(nms[coords]), mclustobj)
        names(ell) <- c('yval', 'xval', 'classification')
        data.frame(xvar = nms[coords[1]], yvar = nms[coords[2]], ell)
   })
}

grid <- expand.grid(x = 1:ncol(orig.data), y = 1:ncol(orig.data))
grid <- subset(grid, x < y)

all <- ldply(1:nrow(grid), function(i) {
  xcol <- grid[i, "x"]
  ycol <- grid[i, "y"]
  data.frame(xvar = names(orig.data)[ycol], yvar = names(orig.data)[xcol],
  x = orig.data[, xcol], y = orig.data[, ycol],
  classification = data.mclust$classification,
  Quantile = with(data.mclust, cut(uncertainty, breaks = c(-0.01,
                  quantile(uncertainty, 0.8), quantile(uncertainty, 0.95), 0.0003),
                  labels = c('< 0.8', '0.8-0.95', '> 0.95'))))
  })

all$Quantile <- factor(all$Quantile, levels = c('< 0.8', '0.8-0.95', '> 0.95'))

all$xvar <- factor(all$xvar, levels = names(orig.data))
all$yvar <- factor(all$yvar, levels = names(orig.data))

densities <- ldply(1:ncol(orig.data), function(i)
       data.frame(xvar = names(orig.data)[i], yvar = names(orig.data)[i], x = orig.data[, i]))

u <- df.ellipses(data.mclust)

g <- ggplot(all)
g + facet_grid(xvar ~ yvar, scales = "free") + theme_bw() +

    geom_point(map = mapping, na.rm = TRUE) +
    stat_density(data = densities, aes(x = x, y = ..scaled.. * diff(range(x)) + min(x),
                                       colour = as.factor(data.mclust$classification)),
                 position = "identity", geom = "line", size = 1) +
    geom_path(data = u, aes(x = xval, y = yval, group = classification,
                            linetype = classification), size = 1) +
    labs(x = '', y = '', colour = 'Class', shape = 'Class', size = 'Quantile') +
    scale_size_manual( values = c(1, 1.5, 2) ) +
    opts(legend.position = c(0.85, 0.75),
         legend.title = theme_text(size = 12, face = 'bold'),
         legend.text = theme_text(size = 11),
         axis.text.x = theme_text(size = 8),
         axis.text.y = theme_text(size = 8) )

I wanted to change the labels for the categories by using expressions, but no can do. At least I found a way to use the extra space to position the legend inside the graph :)

Dennis

Brandon Hurr

unread,
Feb 16, 2011, 9:15:00 AM2/16/11
to Dennis Murphy, ggplot2
Dennis, 

I think my brain has some cobwebs in it. I swear I looked for "uncertainty" in the mclust output and didn't see it. Sure enough, it was the bottom one... 

> str(data.mclust)
List of 11
 $ modelName     : chr "VEV"
 [SNIP]
 $ uncertainty   : num [1:150] 2.51e-11 5.56e-08 3.64e-09 8.61e-08 8.50e-12 ...
 - attr(*, "class")= chr "Mclust"

Looks like it's all sorted then. Wouldn't have even got close without your and Charlotte's help. 

Many thanks, 

Brandon
scatter.matrix.png
mclusthelp.r

Brandon Hurr

unread,
Feb 16, 2011, 9:17:10 AM2/16/11
to Dennis Murphy, ggplot2
That was the older version... 

B
mclusthelp.r

Jens Hegg

unread,
Feb 19, 2011, 7:52:02 PM2/19/11
to Brandon Hurr, Dennis Murphy, Charlotte Wickham, ggplot2
Everyone,

That was quick work. I really like the idea of the pairs plot with density estimations along the diagonal. I think I like it with the density plots separated by group. If they were together there is the potential of a bulbous blob of a distribution even if you have clear groups within the data, depending on the kernel adjustment being used. With them separated you can more easily interpret whether the distributions look separated or are close enough together to be from a single distribution.

Is there a way to draw the dotted lines within the ellipses? I don't see it in the ellipse() manual?

My guess, as far as the uncertainty plot goes, is that the point size is based on the $uncertainty value returned from Mclust(). There are never more than three sizes of dots so it must be binning them. I would assume it's binning based on three st. devs. from the mean of the encertainties but that is just a guess. The uncertainty numbers for the iris dataset are incredibly small, I'm not really sure what they represent. They are also fairly obviously not normally distributed, with all those zeros I would say I'd fit it with an exponential distribution but I'm not expert.  Sorry I'm not more help.


Jens

Brandon Hurr

unread,
Feb 20, 2011, 3:54:17 AM2/20/11
to Jens Hegg, Dennis Murphy, Charlotte Wickham, ggplot2
I'm guessing you'd have to do them with abline, but I think that would be quite complex as you'd have to compute the line equation from the mean center and the shortest and fatest point of the ellipse, the intersects of the lines with the ellipse. Perhaps there's an easier way, but I don't know of it. 

B

Jens Hegg

unread,
Feb 24, 2011, 12:59:34 AM2/24/11
to Brandon Hurr, Dennis Murphy, Charlotte Wickham, ggplot2
All,

I'm attempting to polish up these cluster plots and wanted a way to refer to each plot within the text of my paper. I plotted the cluster centers easily enough with geom_point(). What I would like to do is to plot the cluster number in black on top of the cluster center to make it visible within the noise. I've been trying without success to use geom_text() to plot these cluster centers using a text symbol so that each cluster will be numbered, 1, 2, 3...etc. 

Here is the code that works, without text symbols.

library(ellipse)
library(mclust)
library(ggplot2)

iris_mclust=Mclust(iris[,1:4])

get.ellipses <- function(coords, mclust.fit){
 centers <- mclust.fit$parameters$mean[coords, ]
 vars <- mclust.fit$parameters$variance$sigma[coords, coords, ]
 ldply(1:ncol(centers), function(cluster){
   data.frame(ellipse(vars[,,cluster], centre = centers[, cluster],
     level = 0.5), classification = cluster)
   })
}

iris.el <- get.ellipses(c("Sepal.Length", "Sepal.Width"), iris_mclust)
iris$classification <- iris_mclust$classification

#creates matrix with cluster centers and numbers
cluster_num=cbind(1:nrow(t(iris_mclust$parameters$mean)))
Centers=as.data.frame(cbind(cluster_num, t(iris_mclust$parameters$mean)))
colnames(Centers)=c("clustNum", "Sepal.Length", "Sepal.Width")

ggplot(iris, aes(Sepal.Length, Sepal.Width, colour = factor(classification))) +
 geom_point(aes(shape = classification))+
 geom_path(data = iris.el,
   aes(group = classification, linetype = classification))+

#plots cluster centers
geom_point(data=Centers, aes(x=Sepal.Length, y=Sepal.Width), shape=16, size=3, colour="gray58")



What I tried was adding geom_text() like this:

geom_text(data=Centers, aes(x=Sepal.Length, y=Sepal.Width, label=clustNum))

I get an error - "Error in factor(classification) : object 'classification' not found" - except I don't understand why it needs that factor.


Any help would be appreciated. It seems like this must be simple but I'm not seeing it.

Thanks,

Jens

Dennis Murphy

unread,
Feb 24, 2011, 1:54:34 AM2/24/11
to Jens Hegg, ggplot2
Hi:

In place of geom_point() to mark the cluster centers, use

geom_text(data=Centers, aes(x=Sepal.Length, y=Sepal.Width, label = clustNum), size=10, colour="gray58")

Worked for me. If you're going to do that, the legend, color and shape become rather pointless, don't you think?

HTH,
Dennis

Jens Hegg

unread,
Feb 24, 2011, 2:34:25 AM2/24/11
to Dennis Murphy, ggplot2
Thanks for the help. Apparently I was under the impression that you didn't need to specify the size and colour if you didn't want to. 

While using the large number works I was trying to go for a compound shape, something that would both indicate the exact center and also label the cluster. Using your help this is what I did. Works well and I like the look.

#plots cluster centers with a filled circle and a number within to identify the cluster.
geom_point(data=as.data.frame(Centers), aes(x=Sepal.Length, y=Sepal.Width), shape=16, size=5, colour="gray58") +
geom_text(data=Centers, aes(x=Sepal.Length, y=Sepal.Width, label = clustNum), size=3, colour="black")


Thanks

Jens

Jens Hegg

unread,
Feb 24, 2011, 2:56:05 AM2/24/11
to Dennis Murphy, ggplot2
Also, yes I agree that after labeling them the legend isn't necessary, nor is color or shape depending on what you are doing. In the plots I'm working on using my research data I've pulled the legend out with a legend=NULL statement in the ggplot() element. The ellipses don't capture all the points though, so having either color or shape is still important to understand where the edges of the cluster are. Using both would be overkill. I'm working in B&W for publication so I cut out color, but given my druthers I think color is easier to quickly interpret.

Jens


On Feb 23, 2011, at 10:54 PM, Dennis Murphy wrote:

Reply all
Reply to author
Forward
0 new messages