Distribution curves on histograms

599 views
Skip to first unread message

KG

unread,
Aug 11, 2010, 8:42:32 AM8/11/10
to ggplot2
I am trying to superimpose a Poisson distribution curve and another
distribution curve (whose probability function or values I want to
specify) on a histogram. I cannot seem to figure out how to do that in
ggplot or lattice.

I superimposed a Poisson curve this way:

r<-range(y)
xfit<-seq(r[1],r[2], by=1)
pfit<-dpois(xfit,1)
yfit<-pfit*sum(y)

where y is a data vector

Is there a more straight-forward and/or sophisticated way to so this
using ggplot2 or lattice?

Please help!

Harlan Harris

unread,
Aug 11, 2010, 11:25:53 AM8/11/10
to ggplot2
For me, the easiest way would be to set up a data.frame with pre-
summarized columns: X, data, pois, dist2. Then do something like (off
the top of my head):

ggplot(df, aes(X, data)) + geom_bar() + geom_line(mapping=aes(X,pois),
color="blue") +
geom_line(mapping=aes(X, dist2), color="red")

But there are lots of other ways to do it. You could have two separate
data frames, one with unsummarized data that's plotted with
geom_histogram(), and one with just the fit distributions, in which
case you'd add data=df2 to the geom_line() calls. If you put the fit
points for the two distributions in the same df in long format, with a
column called "DistType" that's either "Poisson" or "Dist2", then the
legend will make sense. There's a geom_function() too, which lets you
provide a function that generates the Y values automatically, but I
personally have found that a little less clear to use than just pre-
generating the values to plot.

-Harlan

KG

unread,
Aug 11, 2010, 1:30:15 PM8/11/10
to ggplot2
Thanks Harlan,

I got the short and long data frames. I guess I had understated/
misstated the question the first time. I wanted to superimpose 2
different distributions to two histograms (generated by facet_grid).
Is that possible?

I constructed this example:
value<-c(rpois(500,1.5))
group<-rep(c("A","B"),250)
data<-data.frame(value,group)
r<-range(value)
xfit<-seq(r[1]-0.5,r[2]-0.5, by=1)
xp<-xfit+0.5
pfit1<-dpois(xp,1)
pfit2<-dpois(xp,2)
yfit1<-pfit1*250
yfit2<-pfit2*250
GrpA<-data$value[data$group=="A"]
GrpB<-data$value[data$group=="B"]
breaks<-seq(r[1],r[2]+1, by=1)
A.cut=cut(GrpA,breaks,right=FALSE)
A=data.frame(table(A.cut))
B.cut=cut(GrpB,breaks,right=FALSE)
B=data.frame(table(B.cut))

data.short<-data.frame(xp,xfit,pfit1,yfit1,pfit2,yfit2,A$Freq,B$Freq)

g1<-ggplot(data,aes(value))
g2<-g1+geom_histogram(aes(y=..count..),binwidth=1,position="identity")
+facet_grid(.~group)
g2+geom_line(data=data.short,mapping=aes(xfit,yfit1),color="blue") +
geom_line(data=data.short,mapping=aes(xfit, yfit2), color="red")

Histogram for group A needs to be fitted with a yfit1. Group B needs
fitted with yfit2.

However, if I am using the same df data.short and using geom_line() to
plot the fits, I am getting both fits in both grids.

One way to avoid this would be to plot two histograms separately, but
can different curves be fitted to the two histograms generated by
facet_grid() and same short df at once?

Thanks,
KG
> > Please help!- Hide quoted text -
>
> - Show quoted text -

Dennis Murphy

unread,
Aug 11, 2010, 9:42:31 PM8/11/10
to KG, ggplot2
Hi:

Things work a little better if you convert data.short from 'wide' to 'long' format. Since the objective is to get a single fitted line graph *per group*, you want the group variable to be present in the summary data frame in order to use it as an aesthetic in geom_line().

I reconstituted data.short as follows - notice the inclusion of group:

short <- data.frame(xp = rep(xp, 2), xfit = rep(xfit, 2),
   group = rep(c('A', 'B'), each = 7), pfit = c(pfit1, pfit2),
   yfit = c(yfit1, yfit2), Freq = c(A$Freq, B$Freq))

The plot call is then

g2 + geom_line(data = short, aes(xfit, yfit, colour = group), size = 1.4) +
     scale_colour_manual(values = c('A' = 'red', 'B' = 'blue'))

I made the lines a bit thicker for illustration and added statement to set colors in the legend. If you don't want the legend, then

g2 + geom_line(data = short, aes(xfit, yfit, colour = group), size = 1.4) +
   scale_colour_manual(values = c('A' = 'red', 'B' = 'blue'), legend = FALSE)

works for me.

HTH,
Dennis


--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

KG

unread,
Aug 12, 2010, 6:41:12 AM8/12/10
to ggplot2
thanks Dennis/Harlan
that works for me
> > To unsubscribe: email ggplot2+u...@googlegroups.com<ggplot2%2Bunsu...@googlegroups.com­>
> > More options:http://groups.google.com/group/ggplot2- Hide quoted text -

Dennis Murphy

unread,
Aug 12, 2010, 10:18:18 PM8/12/10
to KG, ggplot2
Hi:

Now that I'm mentally awake (I hope...unlike yesterday), you really should be using a bar chart to display discrete distributions rather than a histogram, since the latter is meant to communicate an underlying continuum on the horizontal axis. The spacing between bars in a bar chart is there for a reason: to remind the reader that the horizontal scale is discrete.

Here's a modified version of your code that seems to work using geom_bar() in place of geom_histogram(). It doesn't require as much external fiddling. I changed group to gp, since ggplot2 has a tendency to act up when using names in your data frame that match names of ggplot2 aesthetics (such as group, colour, fill).

value<-c(rpois(500,1.5))
gp<-rep(c("A","B"),250)
data<-data.frame(value,gp)
r <- range(value)

pfit1<-dpois(xp,1)
pfit2<-dpois(xp,2)
yfit1<-pfit1*250
yfit2<-pfit2*250
# Create data frame to plot fitted distribution
fits <- data.frame(gp = rep(c('A', 'B'), each = length(yfit1)),
                    xp = rep(seq(r[1], r[2], by = 1), 2),
                    yfit = c(yfit1, yfit2))

# Need to make value a factor in order for the bars to center at
# the integer values. Otherwise, the class boundaries are
# at the integer values, which we don't want here.

# Bar charts by group:
g1 <- ggplot(data, aes(x = factor(value)))
g2 <- g1 + geom_bar() + facet_grid( ~ gp)

# Note: In the call below, we use xp + 1 since the numeric codes of the
# factor levels in factor(value) start at 1 rather than 0, and the bar chart
# was set up with x as a factor. OTOH, xp has to be numeric since
# geom_line() requires both x and y to be numeric (obviously).

g2 + geom_line(data = fits, aes(x = xp + 1, y = yfit, color = gp), size = 1.4) +

     scale_colour_manual(values = c('A' = 'red', 'B' = 'blue'))

# No legend:
g2 + geom_line(data = fits, aes(x = xp + 1, y = yfit, color = gp), size = 1.4) +

     scale_colour_manual(values = c('A' = 'red', 'B' = 'blue'), legend = FALSE)

HTH,
Dennis


KG

unread,
Aug 13, 2010, 3:52:51 AM8/13/10
to ggplot2
Hi

Earlier I was using color="white", fill="black" in geom_histogram to
make the histogram "appear" like bars with discontinuity, but geom_bar
is more authentic. I needed to assign xp = rep(seq(r[1], r[2], by =
1), 2) before pfit1 and pfit2 (which are using xp) in the code above.
However this is important only when creating the example data.

The geom_bar() is now incorporated in my code and things are looking
great (at least for the time being!). Thanks Dennis.

KG

On Aug 13, 3:18 am, Dennis Murphy <djmu...@gmail.com> wrote:
> Hi:
>
> > <ggplot2%2Bunsu...@googlegroups.com<ggplot2%252Bunsubscribe@googlegroup­s.com>
> > ­>
> > > > More options:http://groups.google.com/group/ggplot2-Hide quoted text
Reply all
Reply to author
Forward
0 new messages