Controlling bar placement in geom_bar when plotting from two different data frames

Joel Schwartz

unread,

Jul 16, 2014, 1:25:02 PM7/16/14

to ggp...@googlegroups.com

I was trying to answer a question on Stack Overflow and came up with a ggplot behavior that I couldn't explain. I'm hoping someone here will have the answer.

The questioner wanted to plot the following two data frames separately in a single plot using two calls to geom_bar:

x <- data.frame(dat=rep(seq(1,4),3),let=rep("X"))
y <- data.frame(dat=rep(seq(1,4),4),let=rep("y"))

However, the code below results in the bars being plotted right on top of each other:

ggplot(NULL,aes(dat)) +
geom_bar(data=y,fill="red",width=0.1,position = "dodge") +
geom_bar(data=x,fill="blue",width=0.1,position = "dodge")

I thought the following code would position the bars side-by-side on either side of each major tick mark. However, although the bars are separated, they are not symmetric about each major tick mark and the positioning of the bars is different at each tick mark:

ggplot() +
geom_bar(data=y, aes(dat-0.05), fill="red", width=0.2) +
geom_bar(data=x, aes(dat+0.05), fill="blue", width=0.2)

But it turns out that adding + xlim(0,5) results in the symmetric bars I was expecting. On the other hand, a range of other xlim values result in some bars overlapping and some dodged. For example, xlim(-1,5) results in the first two pairs of bars overlapping and the other two dodged, but not symmetric about the tick mark. xlim(0.5,4.5) results in the first and third pairs of bars overlapping, while the second and fourth pairs are dodged symmetrically about the tick mark.

Can anyone explain this behavior? Is there a way to reliably control the bar positions when plotting two separate data frames with separate calls to geom_bar?

Thanks,
Joel

Dennis Murphy

unread,

Jul 16, 2014, 2:50:16 PM7/16/14

to Joel Schwartz, ggplot2

Hi:

On Wed, Jul 16, 2014 at 10:24 AM, Joel Schwartz <jo...@joelschwartz.com> wrote:
> I was trying to answer a question on Stack Overflow and came up with a
> ggplot behavior that I couldn't explain. I'm hoping someone here will have
> the answer.
>
> The questioner wanted to plot the following two data frames separately in a
> single plot using two calls to geom_bar:
>
> x <- data.frame(dat=rep(seq(1,4),3),let=rep("X"))
> y <- data.frame(dat=rep(seq(1,4),4),let=rep("y"))
>
> However, the code below results in the bars being plotted right on top of
> each other:
>
> ggplot(NULL,aes(dat)) +
> geom_bar(data=y,fill="red",width=0.1,position = "dodge") +
> geom_bar(data=x,fill="blue",width=0.1,position = "dodge")

In geom_bar(), position requires a mapped aesthetic to dodge, stack or
fill. You haven't supplied one. The 'solution' is to concatenate the
data frames, create a factor to distinguish them and proceed from
there.

This version dodges, but is ugly because dat is numeric, which
explains why you need to kludge the x-values:

x <- data.frame(dat=rep(seq(1,4),3), let=rep("x"))
y <- data.frame(dat=rep(seq(1,4),4), let=rep("y"))

DF <- rbind(x, y)
DF$let <- factor(DF$let)

ggplot(DF, aes(x = dat, fill = let)) +
geom_bar(position = "dodge")

Another way is to pre-compute the frequency tables for each data set
and set dat to be a factor. This turns out to be more what ggplot()
expects.

x <- data.frame(dat = factor(seq(4)), freq =
tabulate(rep(seq(1,4),3)), let = "x")
y <- data.frame(dat = factor(seq(4)), freq =
tabulate(rep(seq(1,4),4)), let = "y")

DF <- rbind(x, y)
DF$let <- factor(DF$let)

ggplot(DF, aes(x = dat, y = freq, fill = let)) +
geom_bar(stat = "identity", position = "dodge")

Since data frames x and y have no way to 'communicate' with one
another, ggplot2's training process doesn't have sufficient
information to know that you want to dodge the two frequency tables.
You need a mapped aesthetic to dodge position, and in order to get
that, you have to combine the two data sets and make some preliminary
adjustments (e.g., redefining let as a factor with two levels) before
passing it to ggplot().

>
> I thought the following code would position the bars side-by-side on either
> side of each major tick mark. However, although the bars are separated, they
> are not symmetric about each major tick mark and the positioning of the bars
> is different at each tick mark:
>
> ggplot() +
> geom_bar(data=y, aes(dat-0.05), fill="red", width=0.2) +
> geom_bar(data=x, aes(dat+0.05), fill="blue", width=0.2)

This is necessary because (i) dat is numeric and (ii) you're plotting
the two data frames separately.

>
> But it turns out that adding + xlim(0,5) results in the symmetric bars I was
> expecting. On the other hand, a range of other xlim values result in some
> bars overlapping and some dodged. For example, xlim(-1,5) results in the
> first two pairs of bars overlapping and the other two dodged, but not
> symmetric about the tick mark. xlim(0.5,4.5) results in the first and third
> pairs of bars overlapping, while the second and fourth pairs are dodged
> symmetrically about the tick mark.
>
> Can anyone explain this behavior? Is there a way to reliably control the bar
> positions when plotting two separate data frames with separate calls to
> geom_bar?

If you have two data frames with the same variable names and the same
structure, it's easier to combine them as I showed above. If you
insist on combining separate data frames, you'll need to make manual
adjustments to make up for the lack of information ggplot2 requires to
do it programmatically.

Dennis
>
> Thanks,
> Joel
>
> --
> --
> You received this message because you are subscribed to the ggplot2 mailing
> list.
> Please provide a reproducible example:
> https://github.com/hadley/devtools/wiki/Reproducibility
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ggplot2" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ggplot2+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Joel Schwartz

unread,

Jul 16, 2014, 4:19:15 PM7/16/14

to ggp...@googlegroups.com, Dennis Murphy

Thanks for your reply Dennis.

I would normally combine the data frames and distinguish them with a factor, as you say, but the the Stack Overflow question asked for a way to do it without combining the two data frames. I was hoping there was a way to set the bar position manually, but it sounds like there isn't.

Thanks,
Joel

Joel Schwartz

unread,

Jul 17, 2014, 9:01:16 AM7/17/14

to ggp...@googlegroups.com, Dennis Murphy

It turns out there is a way to control the bar placement. Brian Diggs posted this answer on Stack Overflow:

ggplot(mapping=aes(x=dat))+
  geom_bar(data=y, aes(x=dat-0.1), fill="red", binwidth=0.1)+
  geom_bar(data=x, fill="blue", binwidth=0.1)
The key here is that you are shifting the data by the same amount as one binwidth and that binwidth is less than the spacing between groups. The binning is done on the data after shifting, so that affects which bin the data appears in. Also, without setting the binwidth explicitly, how wide the bins are depend on the range of the plot (which is why it varies when xlim was varied and worked "nicely" for round values).

Brian Diggs

unread,

Jul 17, 2014, 1:37:27 PM7/17/14

to eipi10, ggplot2

On 7/16/2014 8:58 AM, eipi10 wrote:
> I was trying to answer a ggplot question on Stack Overflow
> <http://stackoverflow.com/questions/24769934/ggplot-bar-chart-with-two-dataframes?noredirect=1>
> and came up with a result I couldn't explain. I'm hoping someone here will
> have the answer.

I've added an answer to the question there, but I wanted to give more of
a discussion of what is happening here.

> The questioner wanted to plot the following two data frames separately, but
> without having the bars overlap:

>
> x <- data.frame(dat=rep(seq(1,4),3),let=rep("X"))
> y <- data.frame(dat=rep(seq(1,4),4),let=rep("y"))

> He tried the following, but the bars are right on top of each other:

>
> ggplot(NULL,aes(dat))+
> geom_bar(data=y,fill="red",width=0.1,position = "dodge")+
> geom_bar(data=x,fill="blue",width=0.1,position = "dodge")
>

> I thought the following code would solve the problem:
>
> ggplot() +
> geom_bar(data=y, aes(dat - 0.05), fill="red", width=0.2) +

> geom_bar(data=x, aes(dat + 0.05), fill="blue", width=0.2)
>

> Although my code results in the bars being separated, they are not
> positioned symmetrically on either side of each major tick mark, but
> are shifted by different amounts at different tick marks. Can anyone
> explain what's going on and whether there is a way to control bar
> placement relative to each tick mark?

Realize what you are doing here is shifting the data *before* it is
binned. Now, this shift is enough to move it to the next bin over, but
you are not really putting two bins at the same value next to each
other; you are moving the data into different (adjacent) bins. The
visual effect is about the same, though.

What you are seeing with the shifting around the breaks is that the bins
are not what you think (I'm guessing you think they are 1, 2, 3, 4), but
rather 30 bins spread across the range of x values.

Consider your example

ggplot(NULL,aes(dat))+
geom_bar(data=y, fill="red", position = "dodge") +
geom_bar(data=x, fill="blue", position = "dodge")

The bins on this are

xmin xmax
1 0.9 1.0
2 1.0 1.1
3 1.1 1.2
4 1.2 1.3
5 1.3 1.4
6 1.4 1.5
7 1.5 1.6
8 1.6 1.7
9 1.7 1.8
10 1.8 1.9
11 1.9 2.0
12 2.0 2.1
13 2.1 2.2
14 2.2 2.3
15 2.3 2.4
16 2.4 2.5
17 2.5 2.6
18 2.6 2.7
19 2.7 2.8
20 2.8 2.9
21 2.9 3.0
22 3.0 3.1
23 3.1 3.2
24 3.2 3.3
25 3.3 3.4
26 3.4 3.5
27 3.5 3.6
28 3.6 3.7
29 3.7 3.8
30 3.8 3.9
31 3.9 4.0
32 4.0 4.1

The data all fall in the 2nd, 12th, 22nd, and 32nd bins. And the bins
line up nicely on the round values. That is because the data is from 1
to 4, so the range is 3, divided into 30 bins give a binwidth of 0.1.
And if you put a bin boundary at 0 and then take the relevant ones (and
some padding), you get these breakpoints. But when you shift the data
before binning

ggplot() +
geom_bar(data=y, aes(dat - 0.05), fill="red", width=0.2) +

geom_bar(data=x, aes(dat + 0.05), fill="blue", width=0.2)

the bins are then

xmin xmax
1 0.8266667 0.930000
2 0.9300000 1.033333
3 1.0333333 1.136667
4 1.1366667 1.240000
5 1.2400000 1.343333
6 1.3433333 1.446667
7 1.4466667 1.550000
8 1.5500000 1.653333
9 1.6533333 1.756667
10 1.7566667 1.860000
11 1.8600000 1.963333
12 1.9633333 2.066667
13 2.0666667 2.170000
14 2.1700000 2.273333
15 2.2733333 2.376667
16 2.3766667 2.480000
17 2.4800000 2.583333
18 2.5833333 2.686667
19 2.6866667 2.790000
20 2.7900000 2.893333
21 2.8933333 2.996667
22 2.9966667 3.100000
23 3.1000000 3.203333
24 3.2033333 3.306667
25 3.3066667 3.410000
26 3.4100000 3.513333
27 3.5133333 3.616667
28 3.6166667 3.720000
29 3.7200000 3.823333
30 3.8233333 3.926667
31 3.9266667 4.030000
32 4.0300000 4.133333
33 4.1333333 4.236667

This is consistent. The data go from 0.95 to 4.05, a range of 3.1.
Divide that into 30 bins for a binwidth of 0.1033333. All these
breakpoints are multiples of that binwidth. But that doesn't line them
nicely up on the integers.

Note that this also relies on the shift you are giving (0.05) is less
than the computed binwidth (0.103333) so that some data is shifted down
(no more than) one bin and the other data is not shifted out of its
original bin.

> I realize the "standard" ggplot solution would be to rbind the two
> data frames and use a fill aesthetic to get the two separate bars
> properly dodged (which is discussed in the SO question). However, if,
> for some reason, you want to maintain two separate data frames, is
> there some way to control bar placement when plotting the two data

> frames with separate calls to geom_bar?

The answer I gave at stackoverflow was

ggplot(mapping=aes(x=dat))+
geom_bar(data=y, aes(x=dat-0.1), fill="red", binwidth=0.1)+
geom_bar(data=x, fill="blue", binwidth=0.1)

This makes specific used of the shifting of data into different bins and
is explicit about the binwidth. One set of data is shifted one binwidth
(0.1), the bars are guaranteed to be separated. For this to work, the
binwidth must be smaller than the separation between unique values and
the values need to be multiples of the binwidth. If the data is not
quasi-discrete, this approach will not work, since there won't be empty
bins to shift the other data set(s) into. It will, however, generalize
to more than 2 sets (with the shifts being multiples of the binwidth),
so long as the product of the number of set and the binwidth is still
less than the separation between values.

> Thanks,
> Joel

--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

Brian Diggs

unread,

Jul 17, 2014, 2:30:30 PM7/17/14

to Joel Schwartz, ggplot2

On 7/17/2014 6:00 AM, Joel Schwartz wrote:
> It turns out there is a way to control the bar placement. Brian Diggs
> posted this answer on Stack Overflow:
>
>> ggplot(mapping=aes(x=dat))+
>> geom_bar(data=y, aes(x=dat-0.1), fill="red", binwidth=0.1)+
>> geom_bar(data=x, fill="blue", binwidth=0.1)
>>
>> The key here is that you are shifting the data by the same amount
>> as one binwidth and that binwidth is less than the spacing between
>> groups. The binning is done on the data after shifting, so that
>> affects which bin the data appears in. Also, without setting the
>> binwidth explicitly, how wide the bins are depend on the range of
>> the plot (which is why it varies when xlim was varied and worked
>> "nicely" for round values).

Realize that that approach does not move the (summarized) bars (that is,
it does not "control the bar placement"); rather it moves the data into
otherwise empty bars. The distinction is somewhat subtle, but important
for understanding how it works (and, more importantly, when it will not
work).

I agree with everyone else that the proper way to do this is to
bind-and-dodge, but tackled the question in the spirit of understanding
why the alternative approaches did not work and seeing under what
conditions they could be made to (at least appear to) work.

Reply all

Reply to author

Forward