how to assign sampling weights to a ggplot2 graph?

2,784 views
Skip to first unread message

igreg

unread,
Nov 1, 2010, 4:53:01 PM11/1/10
to ggplot2

dear listmembers...

initial situation: complex sample. to balance our sample, we
calculated for every case a sample weight.
now i would like to plot the graphs including this weight...

my example (without using the variable "sample.weight")



library (ggplot2)


d.data <- data.frame(index=c(1:20),

watch.tv=c(2,3,1,4,2,3,2,1,2,3,2,1,2,3,4,3,2,1,3,2),

read.books=c(4,3,2,3,4,1,2,3,2,3,4,1,2,1,2,1,1,1,1,1),

surf.the.net=c(4,3,4,4,4,4,4,4,4,3,4,3,4,2,1,2,1,2,3,4),

sample.weight=c(0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1))


m.data <- melt (d.data, id="index",measure.vars=2:4) ### ### here i
don't know hot to use the var sample.weight. only 2:4



## calculate mean.value (for ordering the vars)
m.data2 <- ddply(m.data, .(variable), transform,mean.value =
mean(as.numeric(value)))

### factorizing the var "value"
m.data2$value <- factor ((m.data2$value), labels =c("never","once a
month","once a week","daily"))

## cast it
d.data2 <- cast(m.data2, variable + value + mean.value ~., length)

## rename the cols
colnames(d.data2) <- c("variable","Legend", "mean.value", "freq")

## calc percent and position
d.data3 <- ddply(d.data2,"variable",
function(x) {
x <- x[order(x$Legend),]
x$percent <- x$freq/sum(x$freq)*100
x$cfreq <- cumsum(x$freq)/sum(x$freq)
x$pos <- (c(0,x$cfreq[-nrow(x)])+x$cfreq)/2
x
})

### Plot

pl <- ggplot(data=d.data3 , aes(x=reorder(variable,mean.value),
y=freq , fill = Legend))
pl <- pl + geom_bar(stat="identity",position = "fill")
pl <- pl + coord_flip()
pl <- pl + scale_y_continuous("", formatter="percent")
pl <- pl + geom_text(data =
d.data3,aes(x=variable,y=pos,label=(round(percent,1))),size=4)
pl <- pl + scale_x_discrete(name="") ### Text für X-Achse weg
pl




##########

Question: how can assing the variable "sample.weight" to my frequency
data and then plot the weighted graph?!?
after hours of thinking and googling, i don't see any solution...
thanks for every hint!

greg


p.s. ggplot2 can unfortunately not handle objects from the package
"survey". i got this message if i try to plot a weighted survey
sample: "ggplot2 doesn't know how to deal with data of class
survey.design2survey.design".
if this would work, my problem would be probably solved by itself.

Dennis Murphy

unread,
Nov 1, 2010, 8:04:08 PM11/1/10
to igreg, ggplot2
Hi:

See inline.


I don't see how you can reasonably expect to do that. You have a stacked bar chart on a percentage scale where the percentages are assigned within levels of variable. How do you propose to plot the (weighted) mean by variable on a numerical scale that retains any sense of comparability with the visually dominant bar chart? It would make more sense to make a separate plot for the means. BTW, you'd have the same problem no matter which graphics engine you tried - i.e., it's not a ggplot2 issue, it's one of graphics design.

Here's one way to get the plots of means by variable and sampling weight:

 ddply(m.data, .(variable), summarise, mv = mean(value))   # unweighted
      variable  mv
1     watch.tv 2.3
2   read.books 2.1
3 surf.the.net 3.2

ddply(wm.data, .(variable), summarise, wmv = weighted.mean(value, sample.weight))  # weighted
      variable  wmv
1     watch.tv 2.30
2   read.books 2.04
3 surf.the.net 3.14             

# the weighted and unweighted means are not very different when averaged over sampling weight, but we can see differences when we stratify by sampling weight:

wmSumm <- ddply(wm.data, .(variable, sample.weight), summarise, mv = mean(value))
h <- ggplot(wmSumm, aes(x = sample.weight, y = mv, group = variable, colour = variable))
h + geom_point(size = 2.5) + geom_line(size = 1) + ylab('Mean') +
    scale_x_continuous(breaks = c(0.9, 1.1))

You could also try a bar chart that stratifies sampling weight within variable (think position = 'dodge' for the latter); I'll let you think about that.

> with(d.data, table(sample.weight, watch.tv))
             watch.tv
sample.weight 1 2 3 4
          0.9 2 4 3 1
          1.1 2 4 3 1
> with(d.data, table(sample.weight, read.books))
             read.books
sample.weight 1 2 3 4
          0.9 1 3 4 2
          1.1 7 2 0 1
> with(d.data, table(sample.weight, surf.the.net))
             surf.the.net
sample.weight 1 2 3 4
          0.9 0 0 2 8
          1.1 2 3 2 3


greg


p.s. ggplot2 can unfortunately not handle objects from the package
"survey". i got this message if i try to plot a weighted survey
sample: "ggplot2 doesn't know how to deal with data of class
survey.design2survey.design".
if this would work, my problem would be probably solved by itself.

I doubt it; dealing with sampling weights is a nontrivial task. You would likely get that message by using any package specific data object that does not have a ggplot() method written for it. After all, there are over 2500 contributed CRAN packages, many of which define their own object classes.

HTH,
Dennis

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

igreg

unread,
Nov 2, 2010, 11:29:22 AM11/2/10
to ggplot2

thanks a lot for your help, Dennis...


>
> > Question: how can assing the variable "sample.weight" to my frequency
> > data and then plot the weighted graph?!?
> > after hours of thinking and googling, i don't see any solution...
> > thanks for every hint!
>
> I don't see how you can reasonably expect to do that. You have a stacked bar
> chart on a percentage scale where the percentages are assigned within levels
> of variable. How do you propose to plot the (weighted) mean by variable on a
> numerical scale that retains any sense of comparability with the visually
> dominant bar chart? It would make more sense to make a separate plot for the
> means. BTW, you'd have the same problem no matter which graphics engine you
> tried - i.e., it's not a ggplot2 issue, it's one of graphics design.

i don't need the weigted means... i need the weighted frequencies...
(percentages)

let me explain more clearly. on an example from the data above

unweighted frequencies of the var "read.books"

factor - freq - precent
1 (never) - 8 - 40%
2 (once a month) - 5 - 25%
3 (once a week) - 4- 20%
4 (daily) - 3 -15%

now let's use the variable sample.weight to weight this frequencies...

factor - weighted.freq

1 (never) = (1*1*0.9)+(7*1*1.1)=8.6
2 (once a month) = ((3*2*0.9)+(2*2*1.1)) / 2 = 4.9
3 (once a week)= (4*3*0.9)+(0*3*1.1) / 3 = 3.6
4 (daily) = (2*4*0.9)+(1*4*1.1) / 4 = 2.9

the new "weighted.freqs" 8.6 + 4.9 + 3.6 + 2.9 will be tranformed into
"weighted.percents"

the weighted.table for the var read.books looks like this:

1 (never) - 8.6 - 43%
2 (once a month) - 4.9 - 24.5%
3 (once a week) - 3.6- 18%
4 (daily) - 2.9 -14.5%

and this weighted percent-values will be plottet as a stacked bar




> > p.s. ggplot2 can unfortunately not handle objects from the package
> > "survey". i got this message if i try to plot a weighted survey
> > sample: "ggplot2 doesn't know how to deal with data of class
> > survey.design2survey.design".
> > if this would work, my problem would be probably solved by itself.
>
> I doubt it; dealing with sampling weights is a nontrivial task. You would
> likely get that message by using any package specific data object that does
> not have a ggplot() method written for it. After all, there are over 2500
> contributed CRAN packages, many of which define their own object classes.

indeed. dealing with sampling weigths is not trivial. and i see that
ggplot2 is not able to understand every object classes on CRAN. but,
ggplot2 is by far the best graphic package. i've never seen high
quality outputs like this before. i love them.
and survey is the most popular package to analyse complex samples. for
me - and i hope for many more too - it would be fantastic, if ggplot2
would know how to handle objects of the package survey...

thanks for any new idea to solve my problem with the sampling weights

greg


Dennis Murphy

unread,
Nov 2, 2010, 12:46:30 PM11/2/10
to igreg, ggplot2
Hi:


This doesn't look especially hard to program, but you want to do the calculations outside of ggplot2. The most painless route is to arrange your data frame to make the graphics code as simple as possible - this would be true in any of R's graphic engines. It doesn't appear difficult to write a function to produce the weighted frequencies as outlined above. Use the function to input the results from the survey package, process the data as necessary and output a data frame, as that is the expected data class for ggplot2 (and lattice, for that matter). Once the data are in proper shape for ggplot2, the plot call is comparatively simple...you've already done it :)

The general advice is to do as much of the work as possible to produce an input data frame that is easy for the graphics engine to input and process. The more you ask of the graphics engine, the deeper you have to dig into its dark corners. The nice part about doing the work to generate a 'nice' data frame is that you have the entire arsenal of R's default and contributed packages available to help in that task. OTOH, each graphics engine has its limitations, some of which you can learn by reading the associated documentation, and some of which you learn the hard way through personal experience or through the shared experiences of others.

HTH,
Dennis




> > p.s. ggplot2 can unfortunately not handle objects from the package
> > "survey". i got this message if i try to plot a weighted survey
> > sample: "ggplot2 doesn't know how to deal with data of class
> > survey.design2survey.design".
> > if this would work, my problem would be probably solved by itself.
>
> I doubt it; dealing with sampling weights is a nontrivial task. You would
> likely get that message by using any package specific data object that does
> not have a ggplot() method written for it. After all, there are over 2500
> contributed CRAN packages, many of which define their own object classes.

indeed. dealing with sampling weigths is not trivial. and i see that
ggplot2 is not able to understand every object classes on CRAN. but,
ggplot2 is by far the best graphic package. i've never seen high
quality outputs like this before. i love them.
and survey is the most popular package to analyse complex samples. for
me - and i hope for many more too - it would be fantastic, if ggplot2
would know how to handle objects of the package survey...

thanks for any new idea to solve my problem with the sampling weights

greg


--

igreg

unread,
Nov 2, 2010, 4:54:33 PM11/2/10
to ggplot2

Dennis! you're great... the whole thing is a problem of my data
arrangement.

after another 2 hours of trial and error (and of course thinking)... i
got this solution for my example:

d.data <- data.frame(index=c(1:20),

watch.tv=c(2,3,1,4,2,3,2,1,2,3,2,1,2,3,4,3,2,1,3,2),

read.books=c(4,3,2,3,4,1,2,3,2,3,4,1,2,1,2,1,1,1,1,1),

surf.the.net=c(4,3,4,4,4,4,4,4,4,3,4,3,4,2,1,2,1,2,3,4),

sample.weight=c(0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1))


m.data <- melt (d.data, id="sample.weight",measure.vars=2:4) ### now,
i use "sample.weight" as id var
m.data

### compute the weight.freq var
m.data2 <- ddply(m.data, .(variable), transform, weigth.freq =
(as.numeric(value*sample.weight)))

### factorizing the var "value"
m.data2$value <- factor ((m.data2$value), labels =c("never","once a
month","once a week","daily"))

### Compute the sum of the weighted frequencies and and divide by
"value"
m.data.w <- ddply (m.data2,.(variable,value), transform, wfreq =
(sum((weigth.freq/as.numeric(value)))))
m.data.w

## cast it
d.data2 <- cast(m.data.w, variable + value + wfreq ~., length)

## rename the cols
colnames(d.data2) <- c("variable","Legend", "freq","freq_real")

## calc percent and position
d.data3 <- ddply(d.data2,"variable",
function(x) {
x <- x[order(x$Legend),]
x$percent <- x$freq/sum(x$freq)*100
x$cfreq <- cumsum(x$freq)/sum(x$freq)
x$pos <- (c(0,x$cfreq[-nrow(x)])+x$cfreq)/2

x
})


### Plot

pl <- ggplot(data=d.data3 , aes(x=variable, y=freq , fill = Legend))
pl <- pl + geom_bar(stat="identity",position = "fill")
pl <- pl + coord_flip()
pl <- pl + scale_y_continuous("", formatter="percent")
pl <- pl + geom_text(data =
d.data3,aes(x=variable,y=pos,label=(round(percent,1))),size=4)
pl <- pl + scale_x_discrete(name="") ### Text für X-Achse weg
pl

######################

maybe it's not the simplest and best code as possible, but for me IT
WORKS.... yeah..

of course i would still appreciate it if someone could write a
function to input the results from the survey package. but for me,
this is to hard ;-)

best,
greg

Sietse Brouwer

unread,
Nov 3, 2010, 12:15:39 PM11/3/10
to igreg, ggplot2
Hello Greg,

Here's a function (well, a function + a helper function) that should
take a data frame with subjects' responses and weights, and return a
nicely plottable ggplot.

https://gist.github.com/661249

And a possibly more legible and certainly less source()able HTML version:
http://sietse.krikkert.net/sample.html

I'm not familiar with the sample package, so I have no clue whether
this function can easily be reworked to take
survey.design2survey.design objects. I hope it can!

Cheers, and good luck,

Sietse
Sietse Brouwer

igreg

unread,
Nov 4, 2010, 3:38:00 AM11/4/10
to ggplot2

that's cool. thanks a lot for your function, Sietse!

i will try to adapt it to survey design objects...

greg
Reply all
Reply to author
Forward
0 new messages