simple error bars with +-1 Stdv in bar chart

229 views
Skip to first unread message
Message has been deleted

waschbaer

unread,
Feb 4, 2010, 11:18:52 PM2/4/10
to ggplot2
Hi @all
I am just figuring out how nice and convenient R is for plotting
graphs but I came across a problem, which I wasn't able to solve even
after doing a four-hour research on the net.

I have a set of nutrient concentrations (NO2_N, NO3_N, NH3_N, orgN),
which are separated into wet and dry season (Season) for each sampling
location (ID) and sampling round (round):

read.csv("http://dl.dropbox.com/u/4236038/nutrients2.csv") -
>nutrients2

#I melted this data and deleted all Nas:
melt_nutrients2 <- melt(nutrients2, id=c("ID", "round", "Season"),
na.rm=TRUE)

#And then I calculated the mean for each nutrient based on seasons and
ID:

cast(melt_nutrients2, ID+Season ~ variable, mean) ->meansbyseason

#I then melted the casted file:

melt_meansbyseason<-melt(meansbyseason, id=c("ID", "Season"))

#I then plotted the data into bar charts with facets:

melt_meansbyseason$ID <- factor(melt_meansbyseason$ID)

ggplot(melt_meansbyseason, aes(x=ID, y=value, fill=Season)) +
geom_bar(stat='identity', position='dodge') + facet_grid(variable ~ .,
scales = "free_y") + scale_x_discrete()

#My problem now is that I would like to add error bars on each bar
representing one standard deviation. In a first attempt I created #a
csv file out of the melt_meansbyseason and added the standard
deviation for each season and ID manually in a new column #called
STDV.

#I then tried to create the bar charts with the following code.

read.csv("http://dl.dropbox.com/u/4236038/nutrientsbyseason_stdv.csv")
-> nutrients_stdv

ggplot(nutrients_stdv, aes(x=ID, y=value, fill=Season)) +
geom_bar(stat='identity', position='dodge') +
geom_errorbar(aes(ymin=value-stdv,ymax=value+stdv), width=2) +
facet_grid(variable ~ ., scales = "free_y")

The following error came up:
Error: Non-continuous variable supplied to scale_y_continuous.
In addition: Warning messages:
1: In Ops.factor(value, stdv) : - not meaningful for factors
2: In Ops.factor(value, stdv) : + not meaningful for factors

My questions would be:
1. Is there a more effective way in that you can calculate one
standard deviation within the original data file "nutrients2.csv" in
R
and then just use ggplot to plot the bar charts with +-1 Stdv?
2. If not how can I use the modified file with the manually
calculated
standard deviation "nutrientsbyseason_stdv.csv" to display the error
bars on my bar charts?
In advance thanks a lot for your suggestions.
Cheers
Martin

smu

unread,
Feb 5, 2010, 4:02:57 AM2/5/10
to waschbaer, ggplot2
hi waschbaer,

On Thu, Feb 04, 2010 at 08:18:52PM -0800, waschbaer wrote:
> ...


>
> My questions would be:
> 1. Is there a more effective way in that you can calculate one
> standard deviation within the original data file "nutrients2.csv" in
> R
> and then just use ggplot to plot the bar charts with +-1 Stdv?

This is the same way I would do it.

> 2. If not how can I use the modified file with the manually
> calculated
> standard deviation "nutrientsbyseason_stdv.csv" to display the error
> bars on my bar charts?

In your nutrients_stdv, stdv is a factor. simply transform it to a
number and it will work.


regards,
stefan

Xie Chao

unread,
Feb 5, 2010, 4:11:17 AM2/5/10
to waschbaer, ggplot2

I think plyr is easier to use than multiple rounds of melt + cast:

df <- melt(nutrients, id=c("ID", "round", "Season"), na.rm=TRUE)
df <- ddply(df, .(ID, Season, variable),
summarize, mean = mean(value), sd = sd(value))

> 2. If not how can I use the modified file with the manually
> calculated
> standard deviation "nutrientsbyseason_stdv.csv" to display the error
> bars on my bar charts?
> In advance thanks a lot for your suggestions.
> Cheers
> Martin
>

> --
> You received this message because you are subscribed to the ggplot2 mailing list.
> To post to this group, send email to ggp...@googlegroups.com
> To unsubscribe from this group, send email to
> ggplot2+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/ggplot2

waschbaer

unread,
Feb 5, 2010, 8:27:50 AM2/5/10
to ggplot2
Hi Stefan

Thanks but what is the code for number in R? I tried numeric but it
didn't work.

smu

unread,
Feb 5, 2010, 8:46:04 AM2/5/10
to waschbaer, ggplot2

as.numeric(as.character(x))

regards,
stefan

Dennis Murphy

unread,
Feb 5, 2010, 1:24:43 PM2/5/10
to waschbaer, ggplot2
Hi:

I saw the plot you generated, and the first thing I asked myself was: you
want to put error bars on top of that??? I don't understand this fetish with
barcharts and error bars to visualize averages. Having said that, it behooves
me to offer a couple of alternatives.

I ended up using Stefan's method of data reduction. I agree with Xie's approach
in general, but it turned out that a couple of ID/Season/variable combinations
were missing and neither ddply nor summarize wanted to play nice, so...
As it turns out, this was a useful backup plan, as we'll see below.

Although it makes for an overly busy plot, I set ID to be a factor rather than
numeric. The ID values have a large gap in the middle and it was visually
distracting to plot points and lines with ID as numeric. Moreover, no context
for why ID should reasonably be considered numeric was given. The tradeoff
comes with the labeling of the horizontal axis, since all levels of a factor are
plotted...

First plot: ID means by season (color coded) for each response:

p <- ggplot(melt_meansbyseason, aes(x = factor(ID), y = value, color = Season))
p + geom_line() + geom_point() + facet_grid(variable ~ ., scales = "free_y") +
    opts(axis.text.x = theme_text(size = 6))

Differences (and non-differences) among locations and seasons are readily visible
from this plot. There are several obvious outliers at location 18 and a couple at
location 39; moreover, there is a trend in means by location for the orgN
response (are the locations spatially ordered somehow??)  and there are
numerous seasonal differences in the 100+ locations for responses orgN
and NO3_N. When blue dominates red, it means there are no real mean
differences between them, since blue is plotted second.

Second plot: use stat_summary(fun.data = "mean_cl_normal") to plot
confidence intervals for each mean. Again, this is done by location and season.
For ggplot to compute the CI, we have to go back to the originally melted data.

q <- ggplot(melt_nutrients2, aes(x = factor(ID), y = value, color = Season))
q + stat_summary(fun.data = "mean_cl_normal", geom = 'pointrange') +

     facet_grid(variable ~ ., scales = "free_y") +
      opts(axis.text.x = theme_text(size = 6))

This plot is also interesting, as it spotlights differences in means and, especially,
variation across locations within a given response.
The effect of variability in the wet season is obvious in the orgN and NO2_N
responses. The downside of the plot is that it's not easy to get a side-by-side
comparison of the two CIs for seasons, but I would suggest that CIs of the
mean difference would be more pertinent if that were the goal, and I'll leave you
to figure that one out. An alternative stat_summary is mean_sdl.

Frank Harrell's group at Vanderbilt calls bar charts with error bar plots "dynamite
plots", a deliciously evocative moniker. The following link is worth reading:

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/TatsukiKoyama/Poster3.pdf

There are many useful ways to plot averages across different groups. An
alternative to this display would be (Cleveland) dot charts, which have the additional
feature of being amenable to reordering by average or some other measure.
In this case, it's basically the coord_flip() of the above graphics, but it's useful
to see how to get one. The advantage, in this case, is that the location IDs are
easier to see.

What I have in mind is, for an individual response, a dot chart that plots the
two season means for each location. The first one below is ordered by location ID,
the second by average response per location:

# Unordered version of a dot chart for orgN response:
orgn <- subset(melt_meansbyseason, variable == 'orgN')
orgn$variable <- NULL
orgn$ID <- factor(orgn$ID)
r <- ggplot(orgn, aes(x = value, y = factor(ID), color = Season))
r + geom_point()

# Ordered version - need to compute average value by ID and then
#   order IDs by average response. I ended up using aggregate because
#   ddply and summarize wouldn't cooperate.

tt <- aggregate(orgn$value, list(ID = orgn$ID), mean)
names(tt) <- c('ID', 'avgval')
orgn2 <- merge(orgn, tt)
orgn2$ID <- with(orgn2, reorder(ID, avgval))

r <- ggplot(orgn2, aes(x = value, y = ID, color = Season))
r + geom_point()

Whether this is the best way to display averages is open to debate, but the
point is that there are better ways to compare groups by average response
than dynamite charts.

HTH,
Dennis



On Thu, Feb 4, 2010 at 8:00 PM, waschbaer <lab...@gmail.com> wrote:
Hi @all

I am just figuring out how nice and convenient R is for plotting
graphs but I came across a problem, which I wasn't able to solve even
after doing a four-hour research on the net.

I have a set of nutrient concentrations (NO2_N, NO3_N, NH3_N, orgN),
which are separated into wet and dry season (Season) for each sampling
location (ID) and sampling round (round):

read.csv("http://dl.dropbox.com/u/4236038/nutrients2.csv") ->nutrients

I melted this data and deleted all Nas:

melt_nutrients2 <- melt(nutrients2, id=c("ID", "round", "Season"),
na.rm=TRUE)

And then I calculated the mean for each nutrient based on seasons and
ID:

cast(melt_nutrients2, ID+Season ~ variable, mean) ->meansbyseason

I then melted the casted file:

melt_meansbyseason<-melt(meansbyseason, id=c("ID", "Season"))

I then plotted the data into bar charts with facets:

ggplot(melt_meansbyseason, aes(x=ID, y=value, fill=Season)) +
geom_bar(stat='identity', position='dodge') + facet_grid(variable ~ .,
scales = "free_y")

My problem now is that I would like to add error bars on each bar
representing one standard deviation. In a first attempt I created a

csv file out of the melt_meansbyseason and added the standard
deviation for each season and ID manually in a new column called STDV.

I then tried to create the bar charts with the following code.

read.csv("http://dl.dropbox.com/u/4236038/nutrientsbyseason_stdv.csv")
-> nutrients_stdv
ggplot(nutrients_stdv, aes(x=ID, y=value, fill=Season)) +
geom_bar(stat='identity', position='dodge') +
geom_errorbar(aes(ymin=value-stdv,ymax=value+stdv), width=2) +
facet_grid(variable ~ ., scales = "free_y")

The following error came up:

Error: Non-continuous variable supplied to scale_y_continuous.
In addition: Warning messages:
1: In Ops.factor(value, stdv) : - not meaningful for factors
2: In Ops.factor(value, stdv) : + not meaningful for factors

My questions would be:

1. Is there a more effective way in that you can calculate one
standard deviation within the original data file "nutrients2.csv" in R
and then just use ggplot to plot the bar charts with +-1 Stdv?

2. If not how can I use the modified file with the manually calculated
standard deviation "nutrientsbyseason_stdv.csv" to display the error
bars on my bar charts?

In advance thanks a lot for your suggestions.

Cheers
Martin

waschbaer

unread,
Feb 7, 2010, 3:09:33 AM2/7/10
to ggplot2
Hi Dennis

First of all I would like to thank you for your extended comments to
my problem. I really appreciate that.

I agree that the approach I took using barcharts plus error bars on
top of them seems to be not really appropriate for the data I have. I
also read Frank Harrel's comment on dynamite plots as they call
barcharts with error bars. Harrel is right when he says that bar
charts cover a lot of space in the graph but you can't really see what
the spread of the data is and how many measurements were taken.They
simply can't show the detail of the data set.

All that made me think of using geom_point() for the illustration of
the entire data set (no means).

I added one more information to the original data set (catchment ID)
and separated the groundwater sampling locations from the surface
water sampling locations in order to avoid visual distortion of the
concentrations as in the groundwater data set the concentrations were
generally higher. In this example I am just using the surface water
sampling sites (sw). My data set (nutrients_new_sw) now includes:
Sampling ID ("ID), sampling round ("round"), Season ("Season"; wet and
dry), catchment ("catchment"; I have two subcatchments), and nutrient
concentrations (NO2-N...).

I melted the file, kept "ID", "round", "Season", and "catchment" and
removed all NAs. I then re-ordered the sampling IDs from upstream to
downstream location (left to right on the graphs) to reflect spatial
trends in the subcatchments. The last step was the creation of the
plot showing measured concentrations for nutrients for each sampling
location from upstream to downstream separated by catchments (1,2).

I my opinion this graph now shows the distribution of the data and the
spatial and seasonal differences between the sampling locations much
clearer. What do you think?

Furthermore, I have some outliers in 18 and 39, which are distorting
the graph, my question is: Is there a way to make the y-scale for each
nutrient somehow not continuous but stretch more the lower ranges to
show the variations in there and then with a big step just showing the
outliers at the top of the scale? For example: PO4-P conentrations are
generally very low (usualle below 1), but there are a couple of higher
concentrations with the highest being 2.34. This is the outlier in the
graph that is distorting the whole graph for PO4-P.

Thanks a lot for your great help and comments...Martin

Below is the code:

read.csv("http://dl.dropbox.com/u/4236038/new_files/
nutrients_new_sw.csv") -> nutrients_sw

melt_nutrients_sw <- melt(nutrients_sw, id=c("ID", "round", "Season",
"catchment"),
na.rm=TRUE)

melt_nutrients_sw$ID <- factor(melt_nutrients_sw$ID,
levels=c(5,9,10,12,15,14,17,18,19,22,20,1,4,54,50,49,48,43,46,39,38,37,32,33,35,21,23))

p <- ggplot(melt_nutrients_sw, aes(x = ID, y = value, color =
Season))
p + geom_point() + facet_grid(variable~catchment, scales = "free")

> > ggplot2+u...@googlegroups.com<ggplot2%2Bunsu...@googlegroups.com >

James Howison

unread,
Feb 7, 2010, 10:34:41 AM2/7/10
to ggplot2
Yes, I think it shows the distribution pretty well.

You could try adding + scale_y_log10(), I tried that and it gets a little squashed in the number of labels but have a look at the bottom of

http://had.co.nz/ggplot2/scale_continuous.html

for some other transforms. Of course log axis spread out the points at the lower part of the range, which actually obscures the range unless you make the log scale abundantly clear ...

--J

Reply all
Reply to author
Forward
0 new messages