Missing values in geom_line

4,757 views
Skip to first unread message

Laurie Kell

unread,
May 3, 2010, 10:21:51 AM5/3/10
to ggplot2
I am plotting a data.frame that contains times series with columns
with a variety of variable of interest and I am using facet_grid and
groups to summarise it.

In some time series they are missing values for some years, and
geom_line connects the years either side of the missing years
e.g.

dat<-data.frame(year=c(1981:1988,1993:2010),y=rnorm(26))
p <-ggplot(dat)
p+geom_line(aes(year,y))

When what I really want is missing values for these years e.g.

dat<-data.frame(year=c(1981:2010),y=c(rnorm(8),rep(NA,4),rnorm(18)))
p <-ggplot(dat)
p+geom_line(aes(year,y))

I would rather not add all the missing values but use the power of
ggplot to deal with them automatically. I see that geom_line can take
x as discrete values sp tried

dat<-
data.frame(year=factor(c(1981:2010),levels=1981:2010),y=c(rnorm(8),rep(NA,
4),rnorm(18)))
p <-ggplot(dat)
p+geom_line(aes(year,y))+scale_x_discrete()

But this doesn´t seem to do the trick. Is there an easy way to do what
I want?

Thanks in advance Laurie

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

hadley wickham

unread,
May 5, 2010, 4:09:43 PM5/5/10
to Laurie Kell, ggplot2
> I would rather not add all the missing values but use the power of
> ggplot to deal with them automatically. I see that geom_line can take
> x as discrete values sp tried

It's basically impossible to deal with them automatically - how is
ggplot2 supposed to know what is missing?

It's also pretty easy to add in the missing values:

dat<-data.frame(year=c(1981:1988,1993:2010),y=rnorm(26))
all <- expand.grid(year = 1981:2010)
dat <- merge(dat, all, all.y = TRUE)

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Laurie Kell

unread,
May 6, 2010, 2:38:46 AM5/6/10
to hadley wickham, Laurie Kell, ggplot2
On 05/05/2010 22:09, hadley wickham wrote:
>> I would rather not add all the missing values but use the power of
>> ggplot to deal with them automatically. I see that geom_line can take
>> x as discrete values sp tried
>>
> It's basically impossible to deal with them automatically - how is
> ggplot2 supposed to know what is missing?
>
> It's also pretty easy to add in the missing values:
>
> dat<-data.frame(year=c(1981:1988,1993:2010),y=rnorm(26))
> all<- expand.grid(year = 1981:2010)
> dat<- merge(dat, all, all.y = TRUE)
>
> Hadley
>
>
Many thanks, my problem was partly laziness. The example I gave was a
sinplified one as my data.frame had many covariates coming from a glm
that wasn´t a balanced design. I guess the only way is to preprocess
the data. But am I right in thinking that you can´t use factors as the x
variable in geom_line?
As I had thought that

dat<-data.frame(x=factor(c(1991:2000,2005:2010),levels=1991:2010),y=rnorm(16))
ggplot(dat)+geom_line(aes(x,y))

would work as x can be discrete and factor does imply an order.

Many Thanks Laurie

hadley wickham

unread,
May 6, 2010, 10:06:50 AM5/6/10
to Laurie Kell, ggplot2
> Many thanks, my problem was partly laziness. The example I gave was a
> sinplified one as my data.frame  had many covariates coming from a glm that
> wasn´t a balanced design.

This should be easy to fix with application of expand.grid and merge.

I guess the only way is to preprocess the data.
> But am I right in thinking that you can´t use factors as the x variable in
> geom_line?
> As I had thought that
>
> dat<-data.frame(x=factor(c(1991:2000,2005:2010),levels=1991:2010),y=rnorm(16))
> ggplot(dat)+geom_line(aes(x,y))
>
> would work as x can be discrete and factor does imply an order.

By default, ggplot2 uses all the factors displayed on the plot to
generate groups - i.e. you get one line for every combination of
categorical values. This obviously doesn't work here, so you need to
manually override it:

ggplot(dat)+geom_line(aes(x,y, group = 1))

Hadley


--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Dennis Murphy

unread,
Aug 14, 2012, 9:19:10 AM8/14/12
to Lingbing, ggp...@googlegroups.com
Hi:

On Tue, Aug 14, 2012 at 1:15 AM, Lingbing <feng...@gmail.com> wrote:
There is one more thing. If we had an isolated single point, this point would be neglected by drawing geom_line. 
For instance the code is:
dat<-data.frame(year=c(1981:1988, 1993:2010), y=rnorm(26))
all<- expand.grid(year = 1981:2010)
dat<- merge(dat, all, all.y = TRUE) 
dat[14, 2] <- NA
ggplot(dat) + geom_line(aes(factor(year), y, group = 1))

Well, yes. Both points on either side of the 1993 point are missing and you're using the line geometry.

We do have an point in 1993 but it's been missed. 
If we have more isolated single points, this would be an issue. 
Anyone knows how to tackle this problem?

Plot individual points?

dat2 <- datm[13, ]
ggplot(dat, aes(x = year, y = y)) + geom_line() +
     geom_point(data = dat2)

I don't really understand what the question is. What do you expect it to do?

Dennis
Please provide a reproducible example: https://github.com/hadley/devtools/wiki/Reproducibility

Jean-Olivier Irisson

unread,
Aug 14, 2012, 5:40:00 PM8/14/12
to Lingbing, ggp...@googlegroups.com
On 2012-Aug-14, at 10:15 , Lingbing <feng...@gmail.com> wrote:
>
> There is one more thing. If we had an isolated single point, this point would be neglected by drawing geom_line.
> For instance the code is:
> dat<-data.frame(year=c(1981:1988, 1993:2010), y=rnorm(26))
> all<- expand.grid(year = 1981:2010)
> dat<- merge(dat, all, all.y = TRUE)
> dat[14, 2] <- NA
> ggplot(dat) + geom_line(aes(factor(year), y, group = 1))
>
> We do have an point in 1993 but it's been missed.
> If we have more isolated single points, this would be an issue.
> Anyone knows how to tackle this problem?

I usually plot superpose geom_line and geom_point with a small point size (so that it does not show when the line is there) to overcome this issue in a first inspection of the data.

Jean-Olivier Irisson
---
Observatoire Océanologique
Station Zoologique, B.P. 28, Chemin du Lazaret
06230 Villefranche-sur-Mer
Tel: +33 04 93 76 38 04
Mob: +33 06 21 05 19 90
http://jo.irisson.com/
Send me large files at: http://jo.irisson.com/upload/

Lingbing

unread,
Aug 14, 2012, 8:10:33 PM8/14/12
to ggp...@googlegroups.com, Lingbing
Hi,

Thanks for you solution. My question is exactly about missing single points when using geom_line. As you said "Both points on either side of the 1993 point are missing".
Adding points separately is feasible, however when we have many single points, adding them one by one would be a nuisance. 
I tried Jean-Olivier Irisson's idea about  "superpose geom_line and geom_point with a small point size ", which partly overcomes the issue, but not perfectly. 

Thanks, I am still working on it.

Lingbing

unread,
Aug 14, 2012, 8:14:54 PM8/14/12
to ggp...@googlegroups.com, Lingbing
Thanks for you solution.
I tried your idea. It partly overcomes the issue, but not perfectly. Sometimes, we should have wanted the single points to be conspicuous, but using a small point size runs in the opposite direction. 

Jean-Olivier Irisson

unread,
Aug 15, 2012, 6:31:30 AM8/15/12
to Lingbing, ggp...@googlegroups.com
On 2012-Aug-15, at 02:10 , Lingbing <feng...@gmail.com> wrote:
>
> Thanks for you solution. My question is exactly about missing single points when using geom_line. As you said "Both points on either side of the 1993 point are missing".
> Adding points separately is feasible, however when we have many single points, adding them one by one would be a nuisance.

I don't know a one-line way to do this.

But to continue on Dennis' idea, you could detect all such points and plot them in one call to geom_point. Here is a way to do it:

# create data with points surrounded by NAs
set.seed(1)
n <- 100
d <- data.frame(x=1:n, y=runif(n))
i <- sample.int(n, 10)
i <- c(i, i+2)
d$y[i] <- NA

# test plot
library("ggplot2")
ggplot(d) + geom_line()
ggplot(d, aes(x=x, y=y)) + geom_line() + geom_point()

# detect points surrounded by NA
nas <- is.na(d$y)
# in that vector a point surrounded by NAs is a FALSE surrounded by two TRUE
# in numeric terms that's a zero surrounded by two ones
# i.e. a value of 2/3 in a moving average with a window of size 3
# NB: I am using a package to compute the moving average but you could easily recode it "by hand" to make this more portable
library("pastecs")
movAvg <- decaverage(as.numeric(nas))
nasAvg <- as.numeric(extract(movAvg, component="filtered"))
isolated <- which(nasAvg == 2 / 3)
# This would be easy to package in a function called something like isolated(), which would return the indexes of the isolated values given a vector and to use the function afterwards

# now plot the original data with geom_line and only the isolated points with geom_point
ggplot(mapping=aes(x=x, y=y)) + geom_line(data=d) + geom_point(data=d[isolated,])
ggplot(mapping=aes(x=x, y=y)) + geom_line(data=d) + geom_point(data=d[isolated,], size=1, colour="red")
Reply all
Reply to author
Forward
0 new messages