A non-random geom_jitter

3,469 views
Skip to first unread message

BenJWoodcroft

unread,
Nov 12, 2011, 10:52:53 PM11/12/11
to ggplot2
Hi,

Thanks in advance. I'm looking for a way to plot points that avoids
over-plotting, but doesn't have the messiness of jittering. Something
like http://arthritis-research.com/content/figures/ar1759-2-l.jpg

I have a discrete y-axis, comparing 2 classes on the x axis. As in the
linked image, I was thinking of moving the points horizontally, but
not vertically, in a deterministic way so that they are guaranteed not
to overlap. The horizontal distance between points that would overlap
in a regular geom_point would be the same everywhere on the plot. I
imagine the method would probably be only appropriate for datasets
with limited points, but I seem to often work with these.

Some example data:
> dput(dg)
structure(list(xvar = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("class1", "class2"
), class = "factor"), yvar = c(16, 13, 18, 18, 18, 18, 19, 18,
16, 15, 23, 24, 18, 20, 20)), .Names = c("xvar", "yvar"), row.names =
c(NA,
-15L), class = "data.frame")

Currently I would use something like below, but I feel this is
suboptimal because when I look at those plots I tend to automatically
try to interpret the jitter, when it plainly is just randomness
> qplot(xvar,yvar,data=dg,geom='jitter',position=position_jitter(width=0.3, height=0))

Thanks!
ben

Joshua Wiley

unread,
Nov 13, 2011, 12:10:58 AM11/13/11
to BenJWoodcroft, ggplot2
Hi Ben,

I do not believe there is currently a way to do this. I also have
some interest in this and one of my side projects has been working on
this, but it has been slow. There are basically two steps:

1) determine which points are overlapping
2) calculate new values in whichever dimension you are using (in the
picture you referenced, the x plane) to avoid any overlap

Depending what assumptions you can make, 1) becomes more or less
difficult. For example, if you have roughly discrete values that will
not overlap vertically (unless they are exactly the same and directly
on top of each other), then you only need to displace the points when
they have equal values. However, if they could overlap, then you need
to take into account the size of your plotting symbol when determining
if they overlap and the amount of displacement required to alleviate
this.

You can check out the beeswarm package:
http://www.cbs.dtu.dk/~eklund/beeswarm/ it does not work directly
with ggplot2, but you may be able to do some manual work with the
swarmx() or swarmy() functions to get a new set of coordinates for
your points you can directly plot.

Assuming you have a recent version of R with internet access:

source("http://joshuawiley.com/R/Jmisc_devel_installer.R")
require(Jmisc)
Jmisc:::StackAlgorithm

will show you how I attempted to implement an algorithm by Wilkinson
to avoid overplotting. It seems (to me) somewhat more precise than
what is done in the beeswarm package. Essentially, for n points, it
creates a logical n x n matrix of all neighbors. It first picks the
point with the most neighbors, makes that the first 'stack', adjust
the coordinates for all those k points so they do not overlap, removes
the points and proceeds likewise on the n - k x n - k matrix until
there are no more points. It sort of works but has quite a few
troubles and I my attempts to get even the troubled version working
with grid so it could eventually become a grob used in ggplot2 could
make a novel of short failure stories. The only recent progress (from
the last time I discussed this) is that I know a lot more C++ than I
did before and sometime next year I should have the algorithm itself
(which works just fine), implemented in C++ using the Rcpp package by
Dirk and Romain. This should at least take care of the algorithm's
performance issues. Now if only my poor brain could be rewritten in
C++....

Cheers,

Josh

> --
> You received this message because you are subscribed to the ggplot2 mailing list.
> Please provide a reproducible example: http://gist.github.com/270442
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>

--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

Winston Chang

unread,
Nov 13, 2011, 12:14:47 AM11/13/11
to BenJWoodcroft, ggplot2
Hi Ben -

It happens that I wrote some code just yesterday to make this sort of dot plot. I had written it only for a single group, not for the multiple groups you have on the x axis, but I've made some modifications to make it work. The best-looking option is a little bit of a hack, though.

First, load the data and the plyr package:

dg <- structure(list(xvar = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("class1", "class2"
), class = "factor"), yvar = c(16, 13, 18, 18, 18, 18, 19, 18,
16, 15, 23, 24, 18, 20, 20)), .Names = c("xvar", "yvar"), row.names =
c(NA,
-15L), class = "data.frame")

library(plyr)

Next, we'll bin the data at each xvar and yvar, and assign counts of 1, 2, 3, etc, within each bin:

# Get a count for each bin
dg <- ddply(dg, .(xvar, yvar), transform, bincount=1:length(xvar))
#   xvar yvar bincount
# class1   13        1
# class1   15        1
# class1   16        1
# class1   16        2
# class1   18        1
# class1   18        2
# class1   18        3
# class1   18        4
# class1   18        5
# class1   19        1
# class2   18        1
# class2   20        1
# class2   20        2
# class2   23        1
# class2   24        1

Now plot it, using bincount as the x position, and facet by xvar. I've also manually set the limits to get the dots closer to each other:
ggplot(dg, aes(x=bincount, y=yvar)) + geom_point() + 
    facet_grid(. ~ xvar) +
    ylim(0,25) +
    scale_x_continuous(limits=c(-5,10), breaks=NA) +
    opts(axis.title.x=theme_blank())

The first one doesn't look good because the points aren't centered. This version centers the dots, although it's slightly different than the version you linked to. In that version, it looks like the centering rounded to the nearest whole dot, so that odd and even-numbered counts weren't aligned in the middle.

# Center the bin around 0
dg <- ddply(dg, .(xvar, yvar), transform, midbincount=bincount-.5-length(bincount)/2)

ggplot(dg, aes(x=midbincount, y=yvar)) + geom_point() + 
    facet_grid(. ~ xvar) +
    ylim(0,25) +
    scale_x_continuous(limits=c(-10,10), breaks=NA) +
    opts(axis.title.x=theme_blank())



I didn't like the facets, so I decided to try it one more way. I wanted to keep xvar on the x axis, but use midbincount to adjust the position of the dots. To do this, I converted the xvar factor to a numeric vector, xvarn, and then added midbincount/12. The 12 was just a number I chose by trial and error, to make it look good. Also, instead of numeric tick labels on the x-axis, I also used the factor names.

# Hack: convert the xvar factor to a numeric, xvarn
dg$xvarn <- as.integer(dg$xvar)

# Then we'll add xvarn and midbincount divided by some constant
# You'll have use your eye to adjust the constant
ggplot(dg, aes(x= xvarn+midbincount/12, y=yvar)) + 
    geom_point() +
    ylim(0,25) +
    scale_x_continuous(breaks=1:nlevels(dg$xvar),
                       labels=levels(dg$xvar),
                       limits=c(0,3))      # Expand x range so it looks nice




Wilkison has written about dot plots, but his version is slightly different. They allow for truly continuous variables, and can have variable dot spacing. The code I have here won't work on real continuous data; it'll only work properly if the data has discrete values.
http://www.cs.uic.edu/~wilkinson/Publications/dots.pdf

If anyone's interested, I've also written a function to set up a Wilkinson-style dot plot as well, but I haven't yet made it work with multiple groups.

Note to developers: it would be great if there were a geom or stat that made these dot plots -- I wish I could write it myself, but I don't know enough about the inner workings of ggplot2 to do it. If someone wants to do this, I can send the code I have.

I hope this is helpful!
--Winston



dotplot-1.png
dotplot-2.png
dotplot-3.png

Ben Woodcroft

unread,
Nov 13, 2011, 1:50:16 AM11/13/11
to Winston Chang, ggplot2
Hi,

Wow, fantastic answers.

On 13 November 2011 16:14, Winston Chang <winsto...@gmail.com> wrote:
Note to developers: it would be great if there were a geom or stat that made these dot plots -- I wish I could write it myself, but I don't know enough about the inner workings of ggplot2 to do it. If someone wants to do this, I can send the code I have.

I hope this is helpful!
--Winston

Indeed it was. Problem solved. I went with the second one - I'd actually thought that my linked image would look better if the dots were centred. Thankfully, my data is discrete, so your solution was able to entirely solve my problem.

I'll second that notion about thinking it would be great if there was a geom/stat for this - I know this kind of plot in other graphing software such as Prism, but it's mailing lists like this that make me stick with free software (among other things, of course).

Thanks,
ben

Brandon Hurr

unread,
Nov 13, 2011, 3:17:23 AM11/13/11
to Ben Woodcroft, Winston Chang, ggplot2
I'd just like to say that adding this as a standard feature would be cool. Have to work it out a bit better for continuous data, but for discrete data it looks amazing. O_O

Brandon


ben

--

Jean-Louis

unread,
Nov 13, 2011, 3:43:27 AM11/13/11
to ggplot2
Have a look at the beeswarm package, http://www.cbs.dtu.dk/~eklund/beeswarm/

Plotting is done using base R functions but it could be adapted to use
ggplot instead.

However that would require quite a lot of changes in the function
given the way it works.

As I am unable to do that I have just modified the function to return
the coordinates and then use ggplot

here is an (nonreproducible) example

beeswarm(conc ~ nvisit,
data = pkm, pch = 16, spacing=1,
col = 1:4
)
#beeswarm function modified to return out.bee. Quick and dirty...
p <- ggplot(out.bee, aes(x, y,
shape=as.factor(x.orig),colour=as.factor(x.orig)))
pgconc_bee <- p + geom_point(size=2) +
scale_shape(legend=FALSE) +
scale_colour_discrete(legend=FALSE) +
scale_x_continuous(breaks=c(1:length(unique(out.bee
$x.orig))),labels=as.character(unique(out.bee$x.orig)),legend=FALSE) +
scale_y_continuous(breaks=seq(0,21000,1000)) +
xlab("Month") + ylab("Ctrough (ng/mL)") +
opts(title="Ctrough")
print(pgconc_bee)

It is such a nice way to present the data, and the medical community
is sooo found of such graphs, that it would indeed be nice if
"someone" would tweak it to use natively ggplot.

I know the author of the package would be receptive to ggplot/lattice
contributions.

JL

On 13 nov, 09:17, Brandon Hurr <brandon.h...@gmail.com> wrote:
> I'd just like to say that adding this as a standard feature would be cool.
> Have to work it out a bit better for continuous data, but for discrete data
> it looks amazing. O_O
>
> Brandon
>
>
>
>
>
>
>
> On Sun, Nov 13, 2011 at 06:50, Ben Woodcroft <wood...@gmail.com> wrote:
> > Hi,
>
> > Wow, fantastic answers.
>

Winston Chang

unread,
Nov 13, 2011, 2:19:20 PM11/13/11
to Brandon Hurr, Ben Woodcroft, ggplot2
All this talk inspired me to improve my code for a Wilkinson-style dot plot, which can be used with continuous data. With this style of plot, the data points are binned, the bins can have variable spacing. The function wbindots() will do the dot binning and stacking, though you'll still have to do some manual adjustments to make the output look nice.

To use this, set "maxbinwidth" to the maximum bin size. If you set "maxbinwidth" to zero, then it works like the code I posted earlier, where the data has to be discrete/granular.

The way the dots are stacked can be set with "stacking". The possible options are "center", "centerwhole", and "stack". I have some examples below that show the results.


# ============================
# Wilkinson dotplot
wbindots <- function(data, binvar, maxbinwidth=0, stacking="center") {
    require(plyr)
    
    # Sort by the binning var
    data <- data[order(data[, binvar]), ]

    cbinnum      <- 0    # Current bin number
    data$binnum  <- NA
    binend       <- -Inf # End position of current bin
    
    for (i in 1:nrow(data)) {
        # Get current value
        cval <- data[i, binvar]
        
        # If at or past end of bin, center this bin
        # and start a new bin at this point
        if (cval >= binend) {
            binend <- cval + maxbinwidth
            cbinnum <- cbinnum +1
        }
        
        data$binnum[i] <- cbinnum
    }
    
    # Center the bins along binvar
    data <- ddply(data, .(binnum), .fun = 
        function(xx, col) {
            xx$bincenter <- mean(range(xx[,col]))
            return(xx)
        }, col=binvar)


    # Set the stacking position
    data <- ddply(data, .(binnum), transform, stackpos=1:length(binnum))

    if(stacking=="center") {
        data <- ddply(data, .(binnum), transform,
                      stackpos = stackpos - length(stackpos)/2 - .5)
    } else
    if (stacking=="centerwhole") {
        data <- ddply(data, .(binnum), transform, 
                      stackpos = stackpos - round(length(stackpos)/2-.5) - 1)
    } else
    if (stacking=="stack") {
        #Do nothing 
    }
    
    # Remove binnum column
    data$binnum <- NULL

    return(data)
}


Here are a bunch of examples:

# Make sample data
set.seed(123)
dg <- data.frame(group=sample(c("A","B"), 40, replace=TRUE),
                 yvar=rnorm(40, mean=10, sd=4))
# group     yvar
#     A 5.728705
#     B 9.128100
#     A 5.895982
#     ... [40 lines total]


# ======= Single group 
dg <- wbindots(dg, binvar="yvar", maxbinwidth=.5, stacking="center")
# group     yvar bincenter stackpos
#     A 3.253227  3.253227      0.0
#     B 3.804989  3.804989      0.0
#     B 4.938415  4.938415      0.0
#     B 5.447452  5.671717     -1.5
#     B 5.507566  5.671717     -0.5
#     A 5.728705  5.671717      0.5
#   ...

ggplot(dg, aes(x=stackpos, y=bincenter)) + geom_point() + 
    ylim(0,25) +
    scale_x_continuous(limits=c(-15,15), breaks=NA) +
    opts(axis.title.x=theme_blank())



# ===========================
# With multiple groups
# ===========================

# Use ddply to split by column "group", and then run wbindots()
dg <- ddply(dg, .(group), wbindots,
            binvar="yvar", maxbinwidth=.5, stacking="center")

# ======= Faceted graph
fp <- 
ggplot(dg, aes(x=stackpos, y=bincenter)) + geom_point() + 
    facet_grid(. ~ group) +
    ylim(0,25) +
    scale_x_continuous(limits=c(-6,6), breaks=NA) +
    opts(axis.title.x=theme_blank())
fp


# ======== Same, but with stacking="centerwhole"
dgw <- ddply(dg, .(group), wbindots,
            binvar="yvar", maxbinwidth=.5, stacking="centerwhole")

# Make faceted plot, with new data frame
fp %+% dgw



# ======= Same, but stacking="stack"
dgs <- ddply(dg, .(group), wbindots,
            binvar="yvar", maxbinwidth=.5, stacking="stack")

# Make faceted plot, with new data frame
fp %+% dgs



# ======= Non-faceted with the hack described in previous post

# Convert the xvar factor to a numeric, xvarn
dg$groupn <- as.integer(dg$group)

# Then we'll add xvarn and midbincount divided by some constant
# You'll have use your eye to adjust the constant
ggplot(dg, aes(x=groupn+stackpos/15, y=bincenter)) + 
    geom_point() +
    ylim(0,25) +
    scale_x_continuous(breaks=1:nlevels((dg$group)),
                       labels=levels(dg$group),
                       limits=c(0.5,2.5))


dotplot-1.png
dotplot-2.png
dotplot-3.png
dotplot-4.png
dotplot-5.png

Hadley Wickham

unread,
Nov 13, 2011, 3:44:00 PM11/13/11
to Winston Chang, Brandon Hurr, Ben Woodcroft, ggplot2
I'd love to see a version of this code that used a grid grob to base the binning on the size of the points.

Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

baptiste auguie

unread,
Nov 13, 2011, 4:01:42 PM11/13/11
to Hadley Wickham, ggplot2
Hi,

On Mon, Nov 14, 2011 at 9:44 AM, Hadley Wickham <had...@rice.edu> wrote:
> I'd love to see a version of this code that used a grid grob to base the
> binning on the size of the points.
>

Not sure if that can help here but in a previous discussion with JiHo
I proposed this dummy grob to stack points dynamically in the
available space,

library(grid)

pointlessGrob <- function(n=3){
grob(n=n, cl="pointless")
}

drawDetails.pointless <- function(x, recording = FALSE){

y.space <- convertY(unit(1,"npc"), "in", TRUE)
y.quantum <- y.space/x$n
grid.points(x=unit(rep(0.5, x$n), "npc"), y=unit(seq(y.quantum/2,
length=x$n, by=y.quantum), "in"),
size = unit(y.quantum*4/3, "in"), pch=18)

}

grid.pointless <- function(...)
grid.draw(pointlessGrob(...))

grid.pointless()

This is adjusting the point size to fit in a given space; perhaps
that's a good option here.

The reverse problem -- deciding how many points to have for a given
point size and space -- should be quite similar (possibly moving
calculations in a preDrawDetails method).

Cheers,

baptiste

Winston Chang

unread,
Nov 13, 2011, 5:19:10 PM11/13/11
to Hadley Wickham, Brandon Hurr, Ben Woodcroft, ggplot2
On Sun, Nov 13, 2011 at 2:44 PM, Hadley Wickham <had...@rice.edu> wrote:
I'd love to see a version of this code that used a grid grob to base the binning on the size of the points. 

Hadley

I think that one drawback to doing it this way is that you're not controlling the bin size directly, so you'd have a hard time ending up with a nice round number for the bin size. Also, if you have a stat or geom do the binning for you, you won't have easy access to find out how big the bins are (I think), which is probably very important for a graph like this.

It would be nice to be able to say, "I want my bins and points to be exactly 1.5 units wide. Size my points to the appropriate diameter."

In other words, this is the reverse: instead of the bins being based on the size of points, set the size of points based on the bins. I think this is like what Baptiste suggested in his code. I don't yet understand grid graphics well enough to do this myself, but based on Baptiste's code, it doesn't look too complicated.

-Winston

Joshua Wiley

unread,
Nov 13, 2011, 5:58:49 PM11/13/11
to Winston Chang, ggplot2
On Sun, Nov 13, 2011 at 2:19 PM, Winston Chang <winsto...@gmail.com> wrote:
> On Sun, Nov 13, 2011 at 2:44 PM, Hadley Wickham <had...@rice.edu> wrote:
>>
>> I'd love to see a version of this code that used a grid grob to base the
>> binning on the size of the points.
>>
>> Hadley
>
> I think that one drawback to doing it this way is that you're not
> controlling the bin size directly, so you'd have a hard time ending up with
> a nice round number for the bin size. Also, if you have a stat or geom do
> the binning for you, you won't have easy access to find out how big the bins
> are (I think), which is probably very important for a graph like this.
> It would be nice to be able to say, "I want my bins and points to be exactly
> 1.5 units wide. Size my points to the appropriate diameter."

Both may be nice, but I think the bin size being determined by the
point size is more important. If you set fixed bins, all you have
done is duplicate a histogram. The whole point of Wilkinson's paper
was to provide something closer to the raw data that avoids (to the
extent possible) arbitrary bin sizes and shifting of points from their
true values. Based on the point size, bins are sized to be the bare
minimum. Also, bins are not applied to the range of the data---the
data are iteratively binned to avoid a left to right or right to left
bias.

>
> In other words, this is the reverse: instead of the bins being based on the
> size of points, set the size of points based on the bins. I think this is
> like what Baptiste suggested in his code. I don't yet understand grid
> graphics well enough to do this myself, but based on Baptiste's code, it
> doesn't look too complicated.

>
> -Winston
>


> --
> You received this message because you are subscribed to the ggplot2 mailing
> list.
> Please provide a reproducible example: http://gist.github.com/270442
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>

--

Winston Chang

unread,
Nov 13, 2011, 6:05:41 PM11/13/11
to Joshua Wiley, ggplot2
On Sun, Nov 13, 2011 at 4:58 PM, Joshua Wiley <jwiley...@gmail.com> wrote:
On Sun, Nov 13, 2011 at 2:19 PM, Winston Chang <winsto...@gmail.com> wrote:
> On Sun, Nov 13, 2011 at 2:44 PM, Hadley Wickham <had...@rice.edu> wrote:
>>
>> I'd love to see a version of this code that used a grid grob to base the
>> binning on the size of the points.
>>
>> Hadley
>
> I think that one drawback to doing it this way is that you're not
> controlling the bin size directly, so you'd have a hard time ending up with
> a nice round number for the bin size. Also, if you have a stat or geom do
> the binning for you, you won't have easy access to find out how big the bins
> are (I think), which is probably very important for a graph like this.
> It would be nice to be able to say, "I want my bins and points to be exactly
> 1.5 units wide. Size my points to the appropriate diameter."

Both may be nice, but I think the bin size being determined by the
point size is more important.  If you set fixed bins, all you have
done is duplicate a histogram.  The whole point of Wilkinson's paper
was to provide something closer to the raw data that avoids (to the
extent possible) arbitrary bin sizes and shifting of points from their
true values.  Based on the point size, bins are sized to be the bare
minimum.  Also, bins are not applied to the range of the data---the
data are iteratively binned to avoid a left to right or right to left
bias.

Oh, I don't mean that one should set a fixed bin size. Sorry if that wasn't clear. You're right, that would just be a histogram (or a histodot graph). Rather, the maximum bin size would determine the size of the points. In the code I posted, you set the maximum bin size; it would be nice to size the points to match.

-Winston

Winston Chang

unread,
Nov 13, 2011, 10:06:32 PM11/13/11
to Brandon Hurr, Ben Woodcroft, ggplot2
I found a bug in my code for the "centerwhole" stacking. This is what it should be for that part:

    if (stacking=="centerwhole") {
        data <- ddply(data, .(binnum), transform, 
                      stackpos = stackpos - ceiling(length(stackpos)/2))
    }
Reply all
Reply to author
Forward
0 new messages