y axis scale limits with stat_summary and blurry text

781 views
Skip to first unread message

Manuel Spínola

unread,
Oct 9, 2010, 6:52:25 PM10/9/10
to ggplot2
Dear list members,

I have 2 questions (data set and plot attached attached):

This is my code:

p = ggplot(alt, aes(arrecife, altura))

p +  stat_summary(fun.y = mean, geom = "bar")

+ stat_summary(fun.data = "mean_cl_normal", colour = "red", geom = "errorbar", width=0.25)

+ scale_y_continuous(name ="Altura (cm)") + xlab("Arrecife") + aes(fill=arrecife)

+ scale_fill_grey()

+ scale_x_discrete(labels=c("Borde Coralino Blue Hole", "Arrecife Posterior HMC AR", "Arrecife Frontal HMC AR", "Parches de Arrecife HMC AR", "Arrecife Posterior HMC AC"), breaks=levels(alt$arrecife))

+ opts(legend.position = "none") + opts(axis.text.x = theme_text(angle=30, hjust=1))


1. When using stat_summary the y axis limits is too wide (from 0 to 200) I think because the raw data but I need something from 0 to 30.  If I set the limits with coord_cartesian(ylim = c(0, 30)) what I am doing is zooming and the y axis scale doesn't show up very well, loosing the numbers in the axis.

2.  The axis label text show blurry when I set up an angle for the text.

Any suggestion how to fix these 2 problems?
Thank you very much in advance.

Best,

Manuel




--
Manuel Spínola, Ph.D.
Instituto Internacional en Conservación y Manejo de Vida Silvestre
Universidad Nacional
Apartado 1350-3000
Heredia
COSTA RICA
mspi...@una.ac.cr
mspin...@gmail.com
Teléfono: (506) 2277-3598
Fax: (506) 2237-7036
Personal website: Lobito de río
Institutional website: ICOMVIS
altura.txt
altura.png

Dennis Murphy

unread,
Oct 10, 2010, 5:30:27 PM10/10/10
to Manuel Spínola, ggplot2
Hi Manuel:

There are a few problems here, and in the process I believe you've exposed a potential 'gotcha' in ggplot2.

2010/10/9 Manuel Spínola <mspin...@gmail.com>

Dear list members,

I have 2 questions (data set and plot attached attached):

This is my code:

p = ggplot(alt, aes(arrecife, altura))

p +  stat_summary(fun.y = mean, geom = "bar")

+ stat_summary(fun.data = "mean_cl_normal", colour = "red", geom = "errorbar", width=0.25)

+ scale_y_continuous(name ="Altura (cm)") + xlab("Arrecife") + aes(fill=arrecife)

+ scale_fill_grey()

+ scale_x_discrete(labels=c("Borde Coralino Blue Hole", "Arrecife Posterior HMC AR", "Arrecife Frontal HMC AR", "Parches de Arrecife HMC AR", "Arrecife Posterior HMC AC"), breaks=levels(alt$arrecife))

+ opts(legend.position = "none") + opts(axis.text.x = theme_text(angle=30, hjust=1))


Firstly, this is (sigh) *another* dynamite plot, so I find it necessary to point you to
 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/TatsukiRcode/Poster3.pdf

To add to the poster's contents, you are using each bar to represent an average, which could be done with far less ink by replacing it with a point, either in a different color or an enlarged size (or both). Less ink, no loss of information, no domination of the plot by needless, distracting bars.

1. When using stat_summary the y axis limits is too wide (from 0 to 200) I think because the raw data but I need something from 0 to 30.  If I set the limits with coord_cartesian(ylim = c(0, 30)) what I am doing is zooming and the y axis scale doesn't show up very well, loosing the numbers in the axis.

I'm wondering what goes on with mean_cl_normal, as I'll explain below...

2.  The axis label text show blurry when I set up an angle for the text.

Why can't you flip the axis so that the labels can be seen without having to risk whiplash? Example below.

Any suggestion how to fix these 2 problems?
Thank you very much in advance.


After reading in your data, I decided to do groupwise summaries using ddply() in the plyr package, returning the group mean and standard error, and then computing the lower and upper 95% large sample confidence limits (Wald interval). The result is in the object altSumm.
 
se <- function(x) sd(x)/sqrt(length(x))
altSumm <- ddply(alt, 'arrecife', summarise, mean = mean(altura), se = se(altura))
altSumm <- transform(altSumm, ll = mean - 1.96 * se, ul = mean + 1.96 * se)


p2p below is a simplified version of your plot, sans labels. When I combine it with error bar plots computed from altSumm, there are noticeable differences in the confidence limit widths, but even worse, offset means! I could understand somewhat different SEs, but different means altogether doesn't make sense. The black dots are the group means from your original plot that replaced the bars. With altSumm, I have the opportunity to double check the calculations, and did check them against other functions such as tapply() and aggregate(), getting the same answers. This makes me wonder what's going on with 'mean_cl_normal' in stat_summary()...this hypothesis is investigated below.

In addition, the mean_cl_normal results suggest that one group is significantly higher than the others, but a one-way ANOVA failed to confirm that (p = 0.428). The CIs produced from altSumm appear to be more consistent with the ANOVA results (not shown, but simple to do with aov() ).  This is the code to produce the attached plot:

p2 <- ggplot(alt, aes(arrecife, altura))
p2p <- p2 + stat_summary(fun.y = mean, geom = "point", size = 3) + ylim(0, 30) +

          stat_summary(fun.data = "mean_cl_normal", colour = "red", geom = "errorbar", width=0.25)
p2p + geom_errorbar(data = altSumm, aes(arrecife, mean, ymin = mean - se,
                                        ymax = mean + se), colour = 'blue', width = 0.25)

The graph below is a combination stripchart/error bar plot, where the error bar designates the endpoints of a group's approximate 95% CI for mean altura. Alpha transparency is used to lighten the points, a line connecting the means is added and the plot is flipped so that one can read the group labels easily. This plot shows a couple of things: (i) your data are not normally distributed within group, which makes me wonder about the validity of the confidence intervals for mean altura; (ii) the data contain several distinct outliers - since the mean is known to be sensitive to the presence of outliers, a more resistant estimator of center may be called for, such as the median, from which you could obtain a reasonable interval estimate by bootstrapping.
A more appropriate thing to have done here would have been to compare group distributions with a combination of boxplots and stripcharts, which you could use to justify a measure of center other than the mean. (Groupwise density plots would also be useful.) Transformation is an option, but back transformation of endpoints is always an issue of interpretation...

ggplot(alt, aes(arrecife, altura)) +
   geom_point(position = position_jitter(width = 0.1), alpha = 0.3) +
   geom_errorbar(data = altSumm, aes(arrecife, mean, ymin = mean - 1.96 * se,
          ymax = mean + 1.96 * se), colour = 'red', width = 0.25, size = 1) +
   geom_line(data = altSumm, aes(arrecife, mean, group = 1), colour = 'blue') +
   xlab("Arrecife") + ylab("Altura (cm)") +
   scale_x_discrete(labels=c("Borde Coralino Blue Hole", "Arrecife Posterior HMC AR",
                            "Arrecife Frontal HMC AR", "Parches de Arrecife HMC AR",
                            "Arrecife Posterior HMC AC"), breaks=levels(alt$arrecife)) +
   coord_flip()

The plot below is a variation on the error bar plot over the range (0, 30) on the response scale, where the axis are flipped to improve readability of the arrecife labels.

q <- ggplot(altSumm, aes(x = arrecife, y = mean, ymin = ll, ymax = ul))
q + geom_point(size = 3) + geom_errorbar(colour = 'red', width = 0.25) +
    geom_line(aes(group = 1), colour = 'blue') + ylim(0, 30) +
    xlab("") + ylab('Mean Altura') +

    scale_x_discrete(labels=c("Borde Coralino Blue Hole", "Arrecife Posterior HMC AR",
                            "Arrecife Frontal HMC AR", "Parches de Arrecife HMC AR",
                            "Arrecife Posterior HMC AC"), breaks=levels(alt$arrecife)) +
    coord_flip()

The warning message below indicates that before stat_summary() does the calculations of mean and standard error for input into mean_cl_normal, it removes all the altura values larger than 30. This would explain the reduced means and shorter confidence limits in the attached plot using your ggplot() code.
Warning messages:
1: Removed 106 rows containing missing values (stat_summary).
2: Removed 106 rows containing missing values (stat_summary).

Hadley quite reasonably recommends doing as much of the work outside of ggplot2 as possible. By using an external data frame of summary data as a double check, I believe we have identified a potential 'gotcha' when using stat_summary() in a ggplot() and plotting over a restricted range. This doesn't appear to be a sum of layer 'main effects', but rather an interaction.

Checking the means and standard errors under this hypothesis,

# altura <= 30:
> with(subset(alt, altura <= 30), tapply(altura, arrecife, mean))
       a        b        c        d        e
17.79832 12.25385 11.83607 11.74803 10.71739
> with(subset(alt, altura <= 30), tapply(altura, arrecife, se))
        a         b         c         d         e
0.6588371 0.5801642 0.7188748 0.7348834 0.9558682

# all data:
> with(alt, tapply(altura, arrecife, mean))
       a        b        c        d        e
22.46667 18.82000 19.91333 20.21333 16.10000
> with(alt, tapply(altura, arrecife, se))
        a         b         c         d         e
0.9491155 1.9017729 1.8018866 2.1657407 3.9614726

That appears to explain the differences in the attached plot...

Let me end with a provocative question...would this problem have been easier to detect with bars to represent the means?

HTH,
Dennis


Best,

Manuel




--
Manuel Spínola, Ph.D.
Instituto Internacional en Conservación y Manejo de Vida Silvestre
Universidad Nacional
Apartado 1350-3000
Heredia
COSTA RICA
mspi...@una.ac.cr
mspin...@gmail.com
Teléfono: (506) 2277-3598
Fax: (506) 2237-7036
Personal website: Lobito de río
Institutional website: ICOMVIS

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

manuel1.png

Manuel Spínola

unread,
Oct 10, 2010, 7:55:34 PM10/10/10
to Dennis Murphy, ggplot2
Thank you very much Dennis.  Extremely helpful and enlightening.

Best,

Manuel
Reply all
Reply to author
Forward
0 new messages