Problem with geom_bar when missing values

2,060 views
Skip to first unread message

johannes rara

unread,
Feb 2, 2010, 1:57:15 PM2/2/10
to ggp...@googlegroups.com
Hello,

See the example below:

library(ggplot2)
m.vec <- c("2009 11", "2009 12", "2010 01")
df <- data.frame(m = rep(m.vec, each=10), cate = rep(LETTERS[1:3],
10), summ = abs(rnorm(30)))
df <- ddply(df, c("m", "cate"), summarise, summ = sum(summ))
df <- subset(df, !(m == "2009 12" & cate == "B"))
ggplot(df, aes(m, summ)) + geom_bar(position="dodge", stat="identity")
+ facet_wrap(~ cate)

The display in graph B is incorrect; the original data contains no
observations for "2009 12", but instead of leaving a blank here, the
bars for "2009 11" and "2010 1" are spread over the space where the
blank should be. How to fix this?

-jrara

Brian Diggs

unread,
Feb 2, 2010, 5:53:58 PM2/2/10
to ggplot2

If you replace your subset line with

df[df$m=="2009 12" & df$cate == "B","summ"] <- NA

you should get a plot with the space left in. Not sure if this is the
best way, but it works.

--Brian Diggs

johannes rara

unread,
Feb 2, 2010, 11:02:27 PM2/2/10
to Brian Diggs, ggplot2
Thanks. There are many cases that I get a data with missing
observations where there is no indication that the rows are missing
(NA rows). So it would be very nice if ggplot2 could deduce that those
observations are missing. Otherwise I should somehow replace those
observations with missing values, which is not the way I would like to
do it (actually I'm not even sure how to do it).

-jrara

2010/2/3 Brian Diggs <dig...@ohsu.edu>:

> --
> You received this message because you are subscribed to the ggplot2 mailing list.
> To post to this group, send email to ggp...@googlegroups.com
> To unsubscribe from this group, send email to
> ggplot2+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/ggplot2

James Howison

unread,
Feb 2, 2010, 11:32:03 PM2/2/10
to ggplot2
Can you give an example of what you mean? If theres no indication that the rows and missing, how could ggplot2 (or anything) deduce that they are missing :) An example of the data you read in, and what is actually missing data, would be helpful.

johannes rara

unread,
Feb 3, 2010, 7:29:57 AM2/3/10
to James Howison, ggplot2
Well, I think can't explain this better, see my example. I tried to
give nonsene data which shows the problem I'm facing with original
data. I think it pretty odd that ggplot2 is producing two bars (in
category B) where there is three x-axis categories ("2009 11" etc.) as
it shows on the plot. I think that ggplot2 should somehow check the
number of factor levels and in the absense of some factor level in one
category print no bars.

I get the data from a customer in the form on df1 (I can't help it) in
the example below

library(ggplot2)
m.vec <- c("2009 11", "2009 12", "2010 01")
df <- data.frame(m = rep(m.vec, each=10), cate = rep(LETTERS[1:3],
10), summ = abs(rnorm(30)))
df <- ddply(df, c("m", "cate"), summarise, summ = sum(summ))

df1 <- subset(df, !(m == "2009 12" & cate == "B"))
df1
m cate summ
1 2009 11 A 1.284907
2 2009 11 B 3.231534
3 2009 11 C 1.409339
4 2009 12 A 1.322530
6 2009 12 C 1.927723
7 2010 01 A 3.309807
8 2010 01 B 3.473588
9 2010 01 C 4.041790

ggplot(df1, aes(m, summ)) + geom_bar(position="dodge",


stat="identity") + facet_wrap(~ cate)

2010/2/3 James Howison <ja...@freelancepropaganda.com>:

johannes rara

unread,
Feb 3, 2010, 8:16:54 AM2/3/10
to James Howison, ggplot2
So, now I should do this kind of tweak which is not a good solution:

library(ggplot2)
m.vec <- c("2009 11", "2009 12", "2010 01")
df <- data.frame(m = rep(m.vec, each=10), cate = rep(LETTERS[1:3],
10), summ = abs(rnorm(30)))
df <- ddply(df, c("m", "cate"), summarise, summ = sum(summ))

df1 <- subset(df1, !(m == "2009 12" & cate == "B"))
#This wrong


ggplot(df1, aes(m, summ)) + geom_bar(position="dodge",
stat="identity") + facet_wrap(~ cate)

#Now I should using this kind of tweak
tb <- with(df1, table(m, cate))
tb.r <- rownames(tb)[which(tb == 0, arr.ind=T)[1]]
tb.c <- colnames(tb)[which(tb == 0, arr.ind=T)[2]]

df2 <- rbind(df1, c(tb.r, tb.c, NA))
df2$summ <- as.numeric(df2$summ)
#This is right
ggplot(df2, aes(m, summ)) + geom_bar(position="dodge",


stat="identity") + facet_wrap(~ cate)

-jrara

2010/2/3 johannes rara <johann...@gmail.com>:

Dennis Murphy

unread,
Feb 3, 2010, 8:52:54 AM2/3/10
to johannes rara, ggplot2
Hi:

If you do your plot in lattice with barchart(), it will work as you want:

library(lattice)
barchart(summ ~ m | cate, data = df1)

(adjust options to get the display you want)

It's rather evident that you think it's 'obvious' for the display to render
zero counts with missing cells, but from a programming perspective,
it's not that obvious. I worked with someone last month who wanted
to get rid of the categories with missing data in lattice in a multipanel
display, and that person may have thought it was 'obvious' for the cells
with missing data to be removed. Programmers make decisions in
designing software, and those decisions have consequences.
Deepayan made one decision in lattice that suits your needs for this
problem, and Hadley chose another.

Perhaps it's easy to make a change to the ggplot2 code to accommodate
this case, and perhaps not. I can see how modifying the code at this point
might affect the entire faceting system. His 'tweak' may have more
far-reaching consequences than your 'tweak'. Something to consider...

HTH,
Dennis

hadley wickham

unread,
Feb 3, 2010, 9:26:15 AM2/3/10
to johannes rara, James Howison, ggplot2

Try this:

ggplot(df1, aes(m, weight = summ)) + geom_bar() + facet_wrap(~ cate)

Hadley

--
http://had.co.nz/

johannes rara

unread,
Feb 3, 2010, 9:39:09 AM2/3/10
to hadley wickham, James Howison, ggplot2
Thanks, you saved my day!!!

-jrara

2010/2/3 hadley wickham <h.wi...@gmail.com>:

Mark Connolly

unread,
Feb 3, 2010, 9:43:03 AM2/3/10
to johannes rara, hadley wickham, James Howison, ggplot2
Hadley, a little explanation on the logic of the different behaviors? I
seem to be a bit too dense this morning. Maybe some more tea.

hadley wickham

unread,
Feb 3, 2010, 9:50:56 AM2/3/10
to Mark Connolly, johannes rara, James Howison, ggplot2
Hi Mark,

> Hadley, a little explanation on the logic of the different behaviors?  I
> seem to be a bit too dense this morning.  Maybe some more tea.

To be honest, I'm not completely sure - I think it comes down to how
the widths are calculated. stat_bin knows to make bins a fixed width
(0.9) whereas it isn't set by stat_identity so probably uses 90% of
the resolution of the data.

Hadley

--
http://had.co.nz/

Reply all
Reply to author
Forward
0 new messages