Fixing the order of entries in a fill-type barchart

22 views
Skip to first unread message

Zack Weinberg

unread,
Jun 9, 2016, 6:02:26 PM6/9/16
to ggplot2
I'm attempting to plot the data in the attached CSV file and I've
gotten as far as

suppressPackageStartupMessages({
library(ggplot2)
library(RColorBrewer)
})

lwc <- read.csv("language-wordcounts.csv", header=TRUE, comment.char="#")

lwc$lang <- ordered(lwc$lang,
levels=subset(lwc, type=='(all)'&name=='(all)')$lang)

ggplot(subset(lwc, lang!='(all)'), aes(x=name, y=nwords, fill=lang)) +
geom_bar(stat='identity', position='fill') +
coord_flip() +
scale_fill_manual(values=rep(brewer.pal(n=8, name="Dark2"),
length.out=length(levels(lwc$lang))))

I have two problems with this, that I don't know how to fix:

1) Because there are 88 levels of "lang" but only 8 colors in the
palette, it is very important that the stacking order of the bars be
consistent from row to row. However, the SQL query that generated the
CSV file did not produce a consistent ordering. (The desired order is
the order of the '(all)' bar.) Forcing lwc$lang to be an ordered
factor in the proper order fixed the *color* order but did not affect
the *stacking* order. (This is most obvious in the "Russia 2014" bar,
where Russian appears before English.) How do I fix the *stacking*
order? (I imagine this is best done by sorting the data frame, but
data shuffling in R is something I only dabble in.)

2) It would be nice to apply a lightness gradient or something to the
repetitions of the Brewer palette. How would I go about that?

Thanks,
zw
language-wordcounts.csv

Brian

unread,
Jun 9, 2016, 7:47:35 PM6/9/16
to Zack Weinberg, ggplot2
Hi Zack,
it looks like you have a beautiful soup there.

Here's a jab.

suppressPackageStartupMessages({
library(ggplot2)
library(RColorBrewer)
})

lwc <- read.csv("language-wordcounts.csv", header=TRUE, comment.char="#")
## Why not just
## lwc$lang <- factor(lwc$lang)
d.f <- subset(lwc, lang!='(all)')
tots <- tapply(d.f$nwords, d.f$lang, sum)
## ## Threshhold of:
## unpopular <- names(tots[tots < 1e7])
## d.f$lang[d.f$lang %in% unpopular] <- "Other"
## tots <- tapply(d.f$nwords, d.f$lang, sum)
## see http://www.cookbook-r.com/Manipulating_data/Sorting/
d.f$lang <- factor(d.f$lang, levels = names(tots[order(tots, decreasing
= T)]))

box <- ggplot(d.f, aes(x=name, y=nwords, fill=lang)) +
geom_bar(stat='identity', position='fill') +
coord_flip() +
## ## or
## theme(axis.text.x = element_text(angle=90, hjust=1, vjust=1)) +
scale_colour_gradient(low = "blue", high = "red")
ggsave("zw.pdf", box, width = 12, height = 6)
browseURL("zw.pdf")

Instead of trying to get all 88 on a plot, I suggest you make a category
"Other". See the code comments above. You can then note what was
unpopular in a caption, for example.

Best
Brian
Reply all
Reply to author
Forward
0 new messages