barchart with proportion or percent rather than count

5,111 views
Skip to first unread message

Jacob Wegelin

unread,
May 2, 2011, 9:44:11 PM5/2/11
to ggp...@googlegroups.com

Suppose we have a factor (a nominal variable) and we want to plot its distribution with a barplot. The code for plotting the counts (or frequencies) is straightforward in ggplot2. But how does one relabel the quantitative axis (the "continuous scale") for proportions or percents? That is, how does one make a barplot of proportions rather than counts?

In this email I show one way. But it requires the user to hard-code two things:

(1) a formula for the percent or proportion

(2) a non-default name for the y axis.

Is there a simpler or more elegant (and less error-prone) approach?

set.seed(1)
NN<-12
categories<- c("dog", "flea", "human", "rat", NA)
DAAT<-data.frame(
species=factor(sample(categories , prob=c( 1.5 ,3, 1,8,1), size=NN, replace=T), levels=categories)
)
print(DAAT)
# Order the bars by frequency:
DAAT$species<- factor( DAAT$species, levels=rev(names(sort(table( DAAT$species)))))
print(summary(DAAT))
require(ggplot2)
# A barplot of frequencies or counts is straightforward:
print(
ggplot(data=DAAT, aes(x=species)) + geom_bar()
)
# Here is a hard-coded solution to put proportion on the y axis.
print(
ggplot(data=DAAT, aes(x=species, weight=1/nrow(DAAT))) + geom_bar()
# Without the following, the y axis will be incorrectly labeled "count".
+ scale_y_continuous(name="Proportion ")
)

Thanks for any comments

Jacob A. Wegelin
Assistant Professor
Department of Biostatistics
Virginia Commonwealth University
730 East Broad Street Room 3006
P. O. Box 980032
Richmond VA 23298-0032

Scott Chamberlain

unread,
May 3, 2011, 8:12:10 AM5/3/11
to Jacob Wegelin, ggp...@googlegroups.com
Numbers seem slightly different from your final solution at the bottom, but this is close:

qplot(x=species, y=..density.., data=DAAT, geom="histogram", group=1) + ylab("Proportion")

replacing histogram with bar would give the same result. 



Scott
--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

Jacob Wegelin

unread,
May 3, 2011, 11:44:46 AM5/3/11
to Scott Chamberlain, ggp...@googlegroups.com

Thank you for this solution.

Your numbers are different because I divided by the number of rows, whereas with your syntax, ggplot2 does not count the NA row(s) in the sample size. The following produces a plot identical to yours:

set.seed(1)
NN<-12
categories<- c("dog", "flea", "human", "rat", NA)
DAAT<-data.frame(
species=factor(sample(categories
, prob=c( 1.5 ,3, 1,8,1)
, size=NN, replace=T)
, levels=categories)
)

# Order the bars by frequency:
DAAT$species<- factor( DAAT$species, levels=rev(names(sort(table( DAAT$species)))))
print(

ggplot(data=DAAT, aes(x=species, weight=1/sum(!is.na(DAAT$species))))
+ geom_bar()


+ scale_y_continuous(name="Proportion ")
)

Your syntax translated into the intermediate syntax (halfway between the black box of qplot and the explicit syntax of layer), with bar (as you sugggested) instead of histogram, is:

ggplot(data=DAAT, aes(x=species, y=..density.., group=1)) + geom_bar() + scale_y_continuous(name="Proportion")

and the full layer syntax is:

ggplot() + layer(data=DAAT, mapping=aes(x=species, y=..density.., group=1), geom="bar") + scale_y_continuous(name="Proportion")

The crux is the mysterious "group" term. Without that term, all bars are of equal height. The help page online for geom_bar says:

“Layers are divided into groups by the group aesthetic. By default this is set to the interaction of all categorical variables present in the plot.”

Has anyone found an explanation of what it means for layers to be divided into groups in this context?

Jake

On Tue, 3 May 2011, Scott Chamberlain wrote:

> --4dbff11a_2eb141f2_e8
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: 8bit
> Content-Disposition: inline

> --4dbff11a_2eb141f2_e8
> Content-Type: text/html; charset="utf-8"
> Content-Transfer-Encoding: quoted-printable
> Content-Disposition: inline
>
> <div>
> <div>
> <span>Numbers seem slightly different from your final sol=
> ution at the bottom, but this is close:</span></div><div><span><br></span=
> ></div><div><span>qplot(x=3Dspecies, y=3D..density.., data=3DDAAT, geom=3D=
> =22histogram=22, group=3D1) + ylab(=22Proportion=22)</span></div><div><sp=
> an><br></span></div><div><span>replacing histogram with bar would give th=
> e same result.&nbsp;</span></div><div><span><br></span></div><div><span><=
> br></span></div><div><span><br></span></div><div><span>Scott<br>
> </span>
> <span></span>
> =20
> <=21-- <p style=3D=22color: =23a0a0a0;=22>On Monday, May =
> 2, 2011 at 8:44 PM, Jacob Wegelin wrote:</p> -->
> <p style=3D=22color: =23a0a0a0;=22>On Monday, May 2, 2011=
> at 8:44 PM, Jacob Wegelin wrote:</p>
> <blockquote type=3D=22cite=22 style=3D=22border-left-styl=
> e:solid;border-width:1px;margin-left:0px;padding-left:10px;=22>
> <span><div><div><br>Suppose we have a factor (a nomin=
> al variable) and we want to plot its distribution with a barplot. The cod=
> e for plotting the counts (or frequencies) is straightforward in ggplot2.=
> But how does one relabel the quantitative axis (the =22continuous scale=22=
> ) for proportions or percents=3F That is, how does one make a barplot of =
> proportions rather than counts=3F<br><br>In this email I show one way. Bu=
> t it requires the user to hard-code two things:<br><br>(1) a formula for =
> the percent or proportion<br><br>(2) a non-default name for the y axis.<b=
> r><br>Is there a simpler or more elegant (and less error-prone) approach=3F=
> <br><br>set.seed(1)<br>NN&lt;-12<br>categories&lt;- c(=22dog=22, =22flea=22=
> , =22human=22, =22rat=22, NA)<br>DAAT&lt;-data.frame(<br> species=3Dfact=
> or(sample(categories , prob=3Dc( 1.5 ,3, 1,8,1), size=3DNN, replace=3DT),=
> levels=3Dcategories)<br>)<br>print(DAAT)<br>=23 Order the bars by freque=
> ncy:<br>DAAT=24species&lt;- factor( DAAT=24species, levels=3Drev(names(so=
> rt(table( DAAT=24species)))))<br>print(summary(DAAT))<br>require(ggplot2)=
> <br>=23 A barplot of frequencies or counts is straightforward:<br>print(<=
> br> ggplot(data=3DDAAT, aes(x=3Dspecies)) + geom=5Fbar()<br>)<br>=23 Her=
> e is a hard-coded solution to put proportion on the y axis.<br>print(<br>=
> ggplot(data=3DDAAT, aes(x=3Dspecies, weight=3D1/nrow(DAAT))) + geom=5Fb=
> ar()<br>=23 Without the following, the y axis will be incorrectly labeled=
> =22count=22.<br> + scale=5Fy=5Fcontinuous(name=3D=22Proportion =22)<br=
> >)<br><br>Thanks for any comments<br><br>Jacob A. Wegelin<br>Assistant Pr=
> ofessor<br>Department of Biostatistics<br>Virginia Commonwealth Universit=
> y<br>730 East Broad Street Room 3006<br>P. O. Box 980032<br>Richmond VA 2=
> 3298-0032<br><br>-- <br>You received this message because you are subscri=
> bed to the ggplot2 mailing list.<br>Please provide a reproducible example=
> : <a href=3D=22http://gist.github.com/270442=22>http://gist.github.com/27=
> 0442</a><br><br>To post: email <a href=3D=22mailto:ggplot2=40googlegroups=
> .com=22>ggplot2=40googlegroups.com</a><br>To unsubscribe: email <a href=3D=
> =22mailto:ggplot2+unsubscribe=40googlegroups.com=22>ggplot2+unsubscribe=40=
> googlegroups.com</a><br>More options: <a href=3D=22http://groups.google.c=
> om/group/ggplot2=22>http://groups.google.com/group/ggplot2</a><br></div><=
> /div></span>
> =20
> =20
> =20
> =20
> </blockquote>
> =20
> <div>
> <br>
> </div>
> </div>
> </div>
> --4dbff11a_2eb141f2_e8--
>
>

Hadley Wickham

unread,
May 16, 2011, 11:00:37 AM5/16/11
to Jacob Wegelin, Scott Chamberlain, ggp...@googlegroups.com
> Has anyone found an explanation of what it means for layers to be divided
> into groups in this context?

Each group receives it's own geom - one bar per group, one line per group, etc.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Sophia Mayne-Deluca

unread,
Jan 27, 2017, 2:42:56 PM1/27/17
to ggplot2, jacobw...@fastmail.fm, myrmec...@gmail.com, had...@rice.edu
Why is it then that the number you put down for group seems irrelevant so long as you put down a number? For example, 
> ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1)) produces the exact same graph as 
> ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 26))
??
I know this is a super old post, it is just exactly the question I have been searching the internet for the answer to.
Thank you

Brandon Hurr

unread,
Jan 27, 2017, 2:59:10 PM1/27/17
to Sophia Mayne-Deluca, ggplot2, Jacob Wegelin, Scott Chamberlain, Hadley Wickham
Think of the value input to group more of as a factor than a number that has some importance in its value. 
This code also produces the same plot. 
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group="a"))

If you supply only a single grouping variable, the proportions are calculated across the classes of cut. If you supply none it calculates within each cut of diamonds, which is meaningless. See explanation here in geom_count:


You can see it clearly in the data portion of the ggplot build output:
ggplot_build(ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop..)))
ggplot_build(ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1)))



--
--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: https://github.com/hadley/devtools/wiki/Reproducibility
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+unsubscribe@googlegroups.com
More options: http://groups.google.com/group/ggplot2

---
You received this message because you are subscribed to the Google Groups "ggplot2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ggplot2+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sophia Mayne-Deluca

unread,
Jan 27, 2017, 3:13:28 PM1/27/17
to ggplot2, smayne...@gmail.com, jacobw...@fastmail.fm, myrmec...@gmail.com, had...@rice.edu
Thank you for your detailed answer. 
It makes a lot more sense that setting the group to an explicit number is the same as naming a group that number rather than saying that there are that many groups in total.
Thanks again.
To unsubscribe: email ggplot2+u...@googlegroups.com

More options: http://groups.google.com/group/ggplot2

---
You received this message because you are subscribed to the Google Groups "ggplot2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ggplot2+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages