Violin plots in ggplot?

Stavros Macrakis

unread,

Jun 8, 2009, 4:05:52 PM6/8/09

to ggplot2

Has anyone worked on violin plots for ggplot?

(I actually prefer half-violins, but that's a detail...)

-s

learnr

unread,

Jun 8, 2009, 5:10:39 PM6/8/09

to ggplot2

p <- ggplot(mtcars, aes(mpg))
p + geom_ribbon(aes(ymax = ..density.., ymin = - ..density..), stat =
"density") + facet_grid(cyl ~ .)

--
http://learnr.wordpress.com/

JiHO

unread,

Jun 8, 2009, 5:58:51 PM6/8/09

to ggplot2

On 2009-June-08 , at 17:10 , learnr wrote:

> p <- ggplot(mtcars, aes(mpg))
> p + geom_ribbon(aes(ymax = ..density.., ymin = - ..density..), stat =
> "density") + facet_grid(cyl ~ .)
>

Argh that' so cool!

Instead of being clever, as above, I created a clunky geom-like
function a while back. It used to work very well but now has some
display issues on some of my data. The only advantage relative to the
solution by learnr is that you get one x axis instead of facets which
probably looks cleaner when there are many levels. The main
disadvantages is less flexibility/integration within ggplot and
different scales for the density estimates: all violins are the same
width (you'll see what I mean when comparing with the plot by learnr).

Maybe someone can improve on that and provide the best of both worlds.

geom_violin <- function(data, mapping, bw="nrd0", adjust=1,
kernel="gaussian", ...)
#
# geom-like function to draw violin plots with ggplot2. Analogous to
boxplots
# default aesthetics:
# x grouping factor, on the x axis
# y variable
# arguments passed to density
# bw
# adjust
# kernel
# ... passed to density and to geom_polygon
#
{
x = deparse(mapping$x)
y = deparse(mapping$y)
molten = melt(data, measure.var=y)
nbByX = cast(molten, formula=paste(x, "~ variable"),
fun.aggregate=length)

# remove levels for which there are less thant 2 points because
density estimate fails
molten = molten[molten[[x]]%in%(nbByX[nbByX[y]>2,x]),]
molten[[x]] = factor(molten[[x]]) # remove potentially absent factor
levels?

# densities = cast(molten, formula=paste(x, "~ variable"),
fun.aggregate=density)
# why doesn't this work?
# do it 'by hand'

moltenL = split(molten$value, molten[[x]])

densities = lapply(moltenL, function(x, bw, adjust, kernel, ...){
# compute density
out = density(x, bw=bw, adjust=adjust, kernel=kernel, ...)
out = data.frame(y=out$x, dens=out$y)
# scale density
out$dens = out$dens/max(out$dens) * 0.45 # max is 0.45
# duplicate it to make a nice polygon
out2 = data.frame(y=rev(out$y), dens=-rev(out$dens))
out = rbind(out,out2,NA)
}, bw=bw, adjust=adjust, kernel=kernel, ...)

# x scale has step of 1
for (i in 1:length(densities)) {
densities[[i]]$dens = i + densities[[i]]$dens
}
xLabels = names(densities)
densities = do.call("rbind",densities)

# build geom and set correct scale
g = geom_polygon(data=densities, mapping=aes(x=dens, y=y), ...)
s = scale_x_continuous(breaks=1:length(xLabels), labels=xLabels)

return(list(g,s))
}

ggplot() + geom_violin(mtcars, aes(x=cyl, y=mpg))

JiHO
---
http://jo.irisson.free.fr/

Stavros Macrakis

unread,

Jun 8, 2009, 6:06:41 PM6/8/09

to ggplot2, learnr

(learnr, sorry for the repeated mail...)

That works nicely, but I was thinking in terms of a violin plot presented similarly to a boxplot:

       ggplot(mtcars, aes(factor(cyl), mpg)) + geom_boxplot()

Perhaps even with (quantized) continuous data. For example, suppose I have a scatterplot of arrival time (y) vs. speed of cars (x) at a certain place on a road. I might want to show a vertical violin plot of the distribution of speed for each hour of the day under the scatterplot, something like this boxplot:

col <- hsv(1,.5,.7)

col <- hsv(1,.5,.7); ggplot(mtcars) +
geom_boxplot(data=data.frame(cutdisp=as.integer(cut(mtcars$disp,breaks=0:5*100))*100-50,
                               mpg=mtcars$mpg),
               aes(x=cutdisp, y=mpg, group = cutdisp ),
               color=col,
               outlier.colour=alpha(0,1),    # Outliers show themselves so don't show them in boxplot
               fill=alpha(col,.3) ) +
geom_point(data=mtcars,aes(disp,mpg))

Note the messy handling of cuts to align them with the data values. I suppose it would be easy enough to write a centered_cuts function....

Though a quantile regression is one way to show distributions in cases like this, it doesn't necessarily do a good job of showing bimodality and other structure.

              -s

Stavros Macrakis

unread,

Jun 8, 2009, 6:11:16 PM6/8/09

to JiHO, ggplot2

This looks great! I will play with it.

Thanks,

-s

Reply all

Reply to author

Forward