Data Analysts Captivated by Power of R
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
January 7, 2009
Data Analysts Captivated by R’s Power
By ASHLEE VANCE
To some people R is just the 18th letter of the alphabet. To others, it’s the rating on racy movies, a measure of an attic’s insulation or what pirates in movies say.
R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca partly because data mining has entered a golden age, whether being used to set ad prices, find new drugs more quickly or fine-tune financial models. Companies as diverse as Google, Pfizer, Merck, Bank of America, the InterContinental Hotels Group and Shell use it.
But R has also quickly found a following because statisticians, engineers and scientists without computer programming skills find it easy to use.
“R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.”
It is also free. R is an open-source program, and its popularity reflects a shift in the type of software used inside corporations. Open-source software is free for anyone to use and modify. I.B.M., Hewlett-Packard and Dell make billions of dollars a year selling servers that run the open-source Linux operating system, which competes with Windows from Microsoft. Most Web sites are displayed using an open-source application called Apache, and companies increasingly rely on the open-source MySQL database to store their critical information. Many people view the end results of all this technology via the Firefox Web browser, also open-source software.
R is similar to other programming languages, like C, Java and Perl, in that it helps people perform a wide variety of computing tasks by giving them access to various commands. For statisticians, however, R is particularly useful because it contains a number of built-in mechanisms for organizing data, running calculations on the information and creating graphical representations of data sets.
Some people familiar with R describe it as a supercharged version of Microsoft’s Excel spreadsheet software that can help illuminate data trends more clearly than is possible by entering information into rows and columns.
What makes R so useful — and helps explain its quick acceptance — is that statisticians, engineers and scientists can improve the software’s code or write variations for specific tasks. Packages written for R add advanced algorithms, colored and textured graphs and mining techniques to dig deeper into databases.
Close to 1,600 different packages reside on just one of the many Web sites devoted to R, and the number of packages has grown exponentially. One package, called BiodiversityR, offers a graphical interface aimed at making calculations of environmental trends easier.
Another package, called Emu, analyzes speech patterns, while GenABEL is used to study the human genome.
The financial services community has demonstrated a particular affinity for R; dozens of packages exist for derivatives analysis alone.
“The great beauty of R is that you can modify it to do all sorts of things,” said Hal Varian, chief economist at Google. “And you have a lot of prepackaged stuff that’s already available, so you’re standing on the shoulders of giants.”
R first appeared in 1996, when the statistics professors Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand released the code as a free software package.
According to them, the notion of devising something like R sprang up during a hallway conversation. They both wanted technology better suited for their statistics students, who needed to analyze data and produce graphical models of the information. Most comparable software had been designed by computer scientists and proved hard to use.
Lacking deep computer science training, the professors considered their coding efforts more of an academic game than anything else. Nonetheless, starting in about 1991, they worked on R full time. “We were pretty much inseparable for five or six years,” Mr. Gentleman said. “One person would do the typing and one person would do the thinking.”
Some statisticians who took an early look at the software considered it rough around the edges. But despite its shortcomings, R immediately gained a following with people who saw the possibilities in customizing the free software.
John M. Chambers, a former Bell Labs researcher who is now a consulting professor of statistics at Stanford University, was an early champion. At Bell Labs, Mr. Chambers had helped develop S, another statistics software project, which was meant to give researchers of all stripes an accessible data analysis tool. It was, however, not an open-source project.
The software failed to generate broad interest and ultimately the rights to S ended up in the hands of Tibco Software. Now R is surpassing what Mr. Chambers had imagined possible with S.
“The diversity and excitement around what all of these people are doing is great,” Mr. Chambers said.
While it is difficult to calculate exactly how many people use R, those most familiar with the software estimate that close to 250,000 people work with it regularly. The popularity of R at universities could threaten SAS Institute, the privately held business software company that specializes in data analysis software. SAS, with more than $2 billion in annual revenue, has been the preferred tool of scholars and corporate managers.
“R has really become the second language for people coming out of grad school now, and there’s an amazing amount of code being written for it,” said Max Kuhn, associate director of nonclinical statistics at Pfizer. “You can look on the SAS message boards and see there is a proportional downturn in traffic.”
SAS says it has noticed R’s rising popularity at universities, despite educational discounts on its own software, but it dismisses the technology as being of interest to a limited set of people working on very hard tasks.
“I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
But while SAS plays down R’s corporate appeal, companies like Google and Pfizer say they use the software for just about anything they can. Google, for example, taps R for help understanding trends in ad pricing and for illuminating patterns in the search data it collects. Pfizer has created customized packages for R to let its scientists manipulate their own data during nonclinical drug studies rather than send the information off to a statistician.
The co-creators of R express satisfaction that such companies profit from the fruits of their labor and that of hundreds of volunteers.
Mr. Ihaka continues to teach statistics at the University of Auckland and wants to create more advanced software. Mr. Gentleman is applying R-based software, called Bioconductor, in work he is doing on computational biology at the Fred Hutchinson Cancer Research Center in Seattle.
“R is a real demonstration of the power of collaboration, and I don’t think you could construct something like this any other way,” Mr. Ihaka said. “We could have chosen to be commercial, and we would have sold five copies of the software.”
Copyright 2009 The New York Times Company
______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Continued high gratitude to all of R-core and the R community for its
unique accomplishments. Every bit of praise is well-earned and
deserved.
I have continuously claimed to colleagues (primarily pharma industry)
for the past 8 years or so that R is the most exciting going on in the
area of statistics.
Thanks,
Bill
####################
Bill Pikounis
Statistician
On Wed, Jan 7, 2009 at 08:10, Zaslavsky, Alan M.
<zasl...@hcp.med.harvard.edu> wrote:
> This article is accompanied by nice pictures of Robert and Ross.
>
> Data Analysts Captivated by Power of R
> http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
>
> January 7, 2009
> Data Analysts Captivated by R's Power
> By ASHLEE VANCE
>
______________________________________________
Frank
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
Amen to that, and in addition, R is now the top tool for everyday
analysis, not just a research statistician's tool.
Frank
>
> ####################
>
> Bill Pikounis
> Statistician
>
>
>
> On Wed, Jan 7, 2009 at 08:10, Zaslavsky, Alan M.
> <zasl...@hcp.med.harvard.edu> wrote:
>> This article is accompanied by nice pictures of Robert and Ross.
>>
>> Data Analysts Captivated by Power of R
>> http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
>>
>> January 7, 2009
>> Data Analysts Captivated by R's Power
>> By ASHLEE VANCE
>>
>
> ______________________________________________
> R-h...@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
______________________________________________
It probably doesnt get said enough and I am sure I speak for all young
researchers I am very much in debt for all the kind souls who have helped me
and other newbies on this forum over the years,
Thanks very much R team.
Thanks for posting. Does anyone else find the statement by SAS to be
humourous yet arrogant and short-sighted?
Kevin
--
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program
Assistant Professor, Dalla Lana School of Public Health
University of Toronto
email: kevin....@utoronto.ca Tel: 416.864.5776 Fax: 416.864.6057
It is an ignorant comment by a marketing person who has been spoon fed
her lines...it is also a comment being made from a very defensive and
insecure posture.
Congrats to R Core and the R Community. This is yet another sign of R's
growth and maturity.
Regards,
Marc Schwartz
I am curious, is there an archive of 'R in the Media' or 'R in the
Press' articles somewhere? It would be interesting to see how the
perception of R has changed/evolved over time relative to other
packages.
Cheers,
Tony Breyal
On 7 Jan, 13:10, "Zaslavsky, Alan M." <zasla...@hcp.med.harvard.edu>
wrote:
> This article is accompanied by nice pictures of Robert and Ross.
>
> Data Analysts Captivated by Power of R
> http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
> R-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
The SAS spokesman quoted in the article is clearly whistling past the graveyard.
--
Jeff
Unfortunately, that type of FUD issued by the SAS marketing person still
works. I see it at my employer (a large healthcare company.) It's a
battle to change a culture, but ironically the recession helps.
People are now taking notice of the obscene licensing fees for SAS.
Darin
To me it just seemed like a "blast from the past".
Duncan Murdoch
You mean Tibco...
The statement that S "failed to generate broad interest" is also a bit
misleading. I believe S-PLUS had more than 100000 users in its day,
although it may be true that its success was mainly in the academic
world. Obviously the pool of people who knew S from the preceding decade
was very important for the early development of R.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dal...@biostat.ku.dk) FAX: (+45) 35327907
I think that I actually made this statement about both the SAS and
Splus traffic...
I wasn't really trying to be critical of SAS. I was trying to get
across that SAS focused their resources on features that had nothing
to do with *statistical analysis* (e.g. data warehousing etc.)
--
Max
That's a great idea, and I just created an "Rmedia" category on the
REvolutions R blog to track exactly such articles. You can find it
here:
http://blog.revolution-computing.com/rmedia/
If anyone knows of any other mainstream articles about R available
online please let me know, and I'll do a round-up post in that section
to make sure they're captured.
By the way, we're writing about R and issues related to R daily at:
http://blog.revolution-computing.com
# David Smith
--
David M Smith <da...@revolution-computing.com>
Director of Community, REvolution Computing www.revolution-computing.com
Tel: +1 (206) 577-4778 x3203 (Seattle, USA)
Would be so that you can sue them later when a software problem in the
designing of the engine makes your plane fall out of the sky!
Bryan
*************
Bryan Hanson
Professor of Chemistry & Biochemistry
DePauw University, Greencastle IN USA
>> ³I think it addresses a niche market for high-end data analysts that
>> want free, readily available code," said Anne H. Milley, director of
>> technology product marketing at SAS. She adds, ³We have customers who
>> build engines for aircraft. I am happy they are not using freeware
>> when I get on a jet.²
>>
>
> Thanks for posting. Does anyone else find the statement by SAS to be
> humourous yet arrogant and short-sighted?
>
> Kevin
______________________________________________
The author of the article, to his credit, was pretty consistent in using
open source terminology.
Regards,
Marc
on 01/07/2009 10:26 AM Bryan Hanson wrote:
> I believe the SAS person shot themselves in the foot more in more ways than
> one. In my mind, the reason you would pay, as Frank said, for
>
>> non-peer-reviewed software with hidden implementations of analytic
>> methods that cannot be reproduced by others
>
> Would be so that you can sue them later when a software problem in the
> designing of the engine makes your plane fall out of the sky!
______________________________________________
I agree. I work for a consulting firm (human services) and my boss
prefers us to use SPSS, rather than R. It's painful. I have version 11
installed on my Windows laptop. Next year, the license expires!
For someone coming from a SPSS background, R is a little mind-blowing,
simply because it is so much more powerful. But, perseverance pays off.
Once I master Sweave and such, I'll be able to churn out reports much
more quickly than I ever could with SPSS.
I do wish the author of the article had included comments from SPSS, in
addition to the humorous FUD from the SAS spokesperson. Newer versions
of SPSS actually have the option of using R for data analysis, in
addition to the SPSS engine. It would have been interesting to compare
the corporate responses of the two companies.
--
Insert something humorous here. :-)
"I hope that they run SAS on Solaris too, god only knows how tainted the
syscalls are in that linux freeware."
Of course, now Solaris is 'freeware', too, so I suppose that according to
SAS, running SAS on Windows is the best way to be sure you're getting the
right answers.
regards,
ajay
[[alternative HTML version deleted]]
I'm not so sure about that. Since the article described R as
"a supercharged version of Microsoft's Excel", surely people
should run R on Windows and be *ab*so*lute*ly* sure of getting
the right answers (and supercharged to boot)????
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 07-Jan-09 Time: 18:30:39
------------------------------ XFMail ------------------------------
> Unfortunately, that type of FUD issued by the SAS marketing person still
> works. I see it at my employer (a large healthcare company.)
I see it here, at a university. Quote: "We couldn't possibly do our
analysis using some software we've just downloaded from a web site"
*facepalm*
> It's a
> battle to change a culture, but ironically the recession helps.
> People are now taking notice of the obscene licensing fees for SAS.
They'll just keep increasing their educational discount, or as we
say, "the first hit is free"...
BaRRy
though, thinking about it, i suppose if one could work out the 'best'
key words to use, it might be possible to not get too many miss-
classified results, e.g.,
http://news.google.com/news?hl=en&ned=us&nolr=1&q=r+open+source+programming+language&btnG=Search
or something like that. Will be keeping an eye on David's page from
time to time though, just in case he catches anything :-)
lovely to see R getting the attention it so rightly deserves.
On 7 Jan, 18:29, "Ajay ohri" <ohri2...@gmail.com> wrote:
> you can use google alerts to track media coverage of R using some keywords
>
> regards,
>
> ajay
>
> On Wed, Jan 7, 2009 at 9:52 PM, David M Smith <
>
>
>
> da...@revolution-computing.com> wrote:
> > On Wed, Jan 7, 2009 at 6:39 AM, Tony Breyal <tony.bre...@googlemail.com>
> R-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
there must be something wrong with me, but i can't find anything
'humorous yet arrogant and short-sighted' in the idea that engines for
aircraft be built with software that does not advertise itself with
'ABSOLUTELY NO WARRANTY.'
vQ
Presuming that the Google Groups archive of SAS-L is reasonably complete:
http://groups.google.com/group/comp.soft-sys.sas/about
The monthly posting frequency data since 1993 is:
Posts <- structure(list(Jan = c(NA, 546L, 548L, 853L, 1007L, 894L, 514L,
1720L, 1826L, 1941L, 1832L, 1636L, 2122L, 2722L, 2750L, 2305L,
357L), Feb = c(NA, 511L, 734L, 1024L, 1150L, 1068L, 493L, 1519L,
1537L, 1845L, 1846L, 1652L, 1960L, 1645L, 926L, 2255L, NA), Mar = c(NA,
658L, 963L, 805L, 1108L, 945L, 659L, 1177L, 1915L, 2010L, 1755L,
2188L, 629L, 1711L, 1728L, 2712L, NA), Apr = c(NA, 681L, 792L,
1052L, 1315L, 784L, 1077L, 1163L, 1467L, 2199L, 1757L, 1826L,
2169L, 2796L, 2766L, 2789L, NA), May = c(NA, 712L, 945L, 1163L,
1212L, 448L, 778L, 1963L, 1735L, 2373L, 1863L, 1836L, 2283L,
3147L, 2974L, 2025L, NA), Jun = c(NA, 751L, 1002L, 999L, 1127L,
813L, 540L, 1615L, 1905L, 2133L, 1701L, 2606L, 2407L, 2723L,
2691L, 2368L, NA), Jul = c(15L, 763L, 775L, 1184L, 1074L, 896L,
476L, 1572L, 2027L, 2445L, 1926L, 1843L, 2061L, 761L, 2435L,
2607L, NA), Aug = c(458L, 975L, 969L, 1053L, 692L, 823L, 612L,
1696L, 1976L, 1492L, 1689L, 2143L, 1793L, 2027L, 2592L, 2584L,
NA), Sep = c(330L, 703L, 745L, 1176L, 947L, 894L, 1351L, 1491L,
1439L, 1864L, 1646L, 1784L, 1365L, 2714L, 1868L, 2554L, NA),
Oct = c(219L, 805L, 691L, 1197L, 900L, 1129L, 1708L, 1669L,
1592L, 2133L, 1832L, 1712L, 1427L, 2983L, 2320L, 2434L, NA
), Nov = c(472L, 752L, 773L, 911L, 853L, 733L, 1720L, 1490L,
1636L, 1663L, 1545L, 1786L, 1518L, 2848L, 2112L, 1984L, NA
), Dec = c(517L, 666L, 765L, 844L, 677L, 492L, 1595L, 1298L,
1424L, 1520L, 1445L, 2148L, 1524L, 2374L, 1948L, 1921L, NA
)), .Names = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c("1993",
"1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001",
"2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009"
))
> Posts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1993 NA NA NA NA NA NA 15 458 330 219 472 517
1994 546 511 658 681 712 751 763 975 703 805 752 666
1995 548 734 963 792 945 1002 775 969 745 691 773 765
1996 853 1024 805 1052 1163 999 1184 1053 1176 1197 911 844
1997 1007 1150 1108 1315 1212 1127 1074 692 947 900 853 677
1998 894 1068 945 784 448 813 896 823 894 1129 733 492
1999 514 493 659 1077 778 540 476 612 1351 1708 1720 1595
2000 1720 1519 1177 1163 1963 1615 1572 1696 1491 1669 1490 1298
2001 1826 1537 1915 1467 1735 1905 2027 1976 1439 1592 1636 1424
2002 1941 1845 2010 2199 2373 2133 2445 1492 1864 2133 1663 1520
2003 1832 1846 1755 1757 1863 1701 1926 1689 1646 1832 1545 1445
2004 1636 1652 2188 1826 1836 2606 1843 2143 1784 1712 1786 2148
2005 2122 1960 629 2169 2283 2407 2061 1793 1365 1427 1518 1524
2006 2722 1645 1711 2796 3147 2723 761 2027 2714 2983 2848 2374
2007 2750 926 1728 2766 2974 2691 2435 2592 1868 2320 2112 1948
2008 2305 2255 2712 2789 2025 2368 2607 2584 2554 2434 1984 1921
2009 357 NA NA NA NA NA NA NA NA NA NA NA
One can then review the annual posting frequency via:
pdf("SAS-L.pdf", height = 4, width = 7)
mp <- barplot(rowSums(Posts, na.rm = TRUE),
beside = TRUE,
cex.names = 0.6, main = "SAS-L Traffic",
cex.axis = 0.75, las = 1)
mtext(text = rowSums(Posts, na.rm = TRUE), at = mp, side = 1,
line = 2, cex = 0.5)
dev.off()
There would appear to be marked increases in 2000 and again in 2006.
However, it has been flat for the past 3 calendar years. No decline yet,
but it will happen in due course...
No comparable posting data table exists for S-News as far as I can find,
so I wrote a quick program to read the S-News archive pages here:
http://www.biostat.wustl.edu/archives/html/s-news/
and get monthly posting counts, using the 'Thread' based html pages,
where each monthly embedded post link has a URL of the form:
http://www.biostat.wustl.edu/archives/html/s-news/YYYY-MM/msgXXXXX.html
Thus, the program I used is:
TD <- paste(rep(1998:2009, each = 12), sprintf("%02d", 1:12), sep = "-")
Posts <- numeric(length(TD))
for (i in seq(along = TD))
{
URL <- paste("http://www.biostat.wustl.edu/archives/html/s-news/",
TD[i], "/threads.html", sep = "")
cat(URL, "\n")
if (!inherits(try(con <- readLines(URL)), "try-error"))
{
Posts[i] <- length(grep("msg.*\\.html", con))
rm(con)
} else {
Posts[i] <- NA
}
}
Posts <- matrix(Posts, ncol = 12, byrow = TRUE)
rownames(Posts) <- 1998:2009
colnames(Posts) <- month.abb
That gives you:
Posts <- structure(c(NA, 210, 264, 246, 230, 189, 197, 174, 109, 51, 48,
5, 273, 173, 313, 232, 255, 179, 230, 161, 87, 59, 63, NA, 378,
313, 285, 252, 242, 218, 257, 193, 99, 74, 58, NA, 293, 300,
264, 300, 228, 196, 151, 182, 123, 48, 47, NA, 330, 334, 306,
331, 219, 189, 164, 174, 107, 46, 31, NA, 243, 254, 247, 282,
248, 217, 175, 109, 96, 34, 27, NA, 219, 284, 245, 258, 230,
221, 154, 159, 84, 47, 40, NA, 209, 270, 302, 260, 207, 187,
187, 144, 97, 39, 28, NA, 191, 300, 204, 260, 221, 186, 195,
107, 68, 35, 41, NA, 241, 253, 251, 229, 280, 295, 150, 98, 73,
70, 30, NA, 181, 300, 261, 232, 228, 197, 176, 82, 53, 56, 27,
NA, 141, 194, 176, 194, 177, 142, 176, 84, 20, 41, 36, NA), .Dim = c(12L,
12L), .Dimnames = list(c("1998", "1999", "2000", "2001", "2002",
"2003", "2004", "2005", "2006", "2007", "2008", "2009"), c("Jan",
"Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec")))
> Posts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1998 NA 273 378 293 330 243 219 209 191 241 181 141
1999 210 173 313 300 334 254 284 270 300 253 300 194
2000 264 313 285 264 306 247 245 302 204 251 261 176
2001 246 232 252 300 331 282 258 260 260 229 232 194
2002 230 255 242 228 219 248 230 207 221 280 228 177
2003 189 179 218 196 189 217 221 187 186 295 197 142
2004 197 230 257 151 164 175 154 187 195 150 176 176
2005 174 161 193 182 174 109 159 144 107 98 82 84
2006 109 87 99 123 107 96 84 97 68 73 53 20
2007 51 59 74 48 46 34 47 39 35 70 56 41
2008 48 63 58 47 31 27 40 28 41 30 27 36
2009 5 NA NA NA NA NA NA NA NA NA NA NA
Which can then be graphed by:
pdf("S-News.pdf", height = 4, width = 7)
mp <- barplot(rowSums(Posts, na.rm = TRUE),
beside = TRUE,
cex.names = 0.6, main = "S-News Traffic",
cex.axis = 0.75, las = 1)
mtext(text = rowSums(Posts, na.rm = TRUE), at = mp, side = 1,
line = 2, cex = 0.5)
dev.off()
The consistent decline in posting frequency since 1999 is notable. The
temporal association with the introduction of R is perhaps profound.
As long as I am on the subject, I figured that I would do the same for
R-Help. The downside is that readLines() (really url() ) does not
support https:, so I took a somewhat different approach, using wget:
TD <- paste(rep(1997:2009, each = 12), month.name, sep = "-")
Posts <- numeric(length(TD))
for (i in seq(along = TD))
{
URL <- paste("https://stat.ethz.ch/pipermail/r-help/",
TD[i], "/thread.html", sep = "")
cat(URL, "\n")
CMD <- paste("wget", URL)
system(CMD)
if (file.exists("thread.html"))
{
con <- readLines("thread.html")
Posts[i] <- length(grep("[0-9]+\\.html", con))
rm(con)
unlink("thread.html")
} else {
Posts[i] <- NA
}
}
Posts <- matrix(Posts, ncol = 12, byrow = TRUE)
rownames(Posts) <- 1997:2009
colnames(Posts) <- month.abb
This gives you:
Posts <- structure(c(NA, 135, 226, 205, 558, 884, 1017, 1116, 1746,
2075, 1714, 2490, 462, NA, 79, 145, 355, 583, 697, 1137, 1580, 1724,
1920, 1907, 2583, NA, NA, 114, 195, 377, 651, 880, 1203, 1946,
1703, 2270, 2191, 2740, NA, 92, 101, 189, 377, 470, 965, 1488,
1657, 2057, 1818, 2145, 2487, NA, 36, 90, 161, 504, 552, 1057,
1268, 1561, 1887, 2029, 2210, 2517, NA, 47, 105, 186, 418, 550,
926, 1319, 1714, 2056, 1811, 2307, 2774, NA, 41, 110, 184, 293,
615, 918, 1344, 1618, 1872, 1785, 2138, 3268, NA, 37, 64, 148,
356, 562, 824, 1210, 1493, 1777, 1898, 2241, 2813, NA, 40, 94,
203, 434, 678, 705, 1443, 1534, 1709, 1902, 2028, 2990, NA, 76,
96, 231, 418, 657, 1055, 1567, 1712, 1810, 2328, 2708, 3037,
NA, 61, 184, 318, 433, 825, 1038, 1605, 1895, 1907, 2127, 2594,
2730, NA, 57, 105, 221, 422, 530, 742, 1158, 1481, 1508, 1450,
2028, 2399, NA), .Dim = c(13L, 12L), .Dimnames = list(c("1997",
"1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005",
"2006", "2007", "2008", "2009"), c("Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
> Posts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1997 NA NA NA 92 36 47 41 37 40 76 61 57
1998 135 79 114 101 90 105 110 64 94 96 184 105
1999 226 145 195 189 161 186 184 148 203 231 318 221
2000 205 355 377 377 504 418 293 356 434 418 433 422
2001 558 583 651 470 552 550 615 562 678 657 825 530
2002 884 697 880 965 1057 926 918 824 705 1055 1038 742
2003 1017 1137 1203 1488 1268 1319 1344 1210 1443 1567 1605 1158
2004 1116 1580 1946 1657 1561 1714 1618 1493 1534 1712 1895 1481
2005 1746 1724 1703 2057 1887 2056 1872 1777 1709 1810 1907 1508
2006 2075 1920 2270 1818 2029 1811 1785 1898 1902 2328 2127 1450
2007 1714 1907 2191 2145 2210 2307 2138 2241 2028 2708 2594 2028
2008 2490 2583 2740 2487 2517 2774 3268 2813 2990 3037 2730 2399
2009 462 NA NA NA NA NA NA NA NA NA NA NA
Which again can be graphed as:
pdf("R-Help.pdf", height = 4, width = 7)
mp <- barplot(rowSums(Posts, na.rm = TRUE),
beside = TRUE,
cex.names = 0.6, main = "R-Help Traffic",
cex.axis = 0.75, las = 1)
mtext(text = rowSums(Posts, na.rm = TRUE), at = mp, side = 1,
line = 2, cex = 0.5)
dev.off()
Now....there's a healthy growth curve.... :-)
Note that the annual traffic volume for 2008 on R-Help exceeds that on
SAS-L.
For convenience, I am attaching each of the 3 plots.
Regards,
Marc Schwartz
Spencer
Yes, everyone knows that the lack of warranty should be hidden in the
fine print, and say something like this:
"Institute warrants that the media on which SAS/C OnlineDoc is furnished
will be free from defects in material and workmanship under normal use
for a period of ninety (90) days from the date of delivery of SAS/C
OnlineDoc. Licensee’s exclusive remedy for breach of this warranty shall
be replacement of the defective media by the Institute. Institute and
its licensors disclaim all other warranties, express or implied,
including, but not limited to, any implied warranties of merchantability
and/or fitness for a particular purpose whether alleged to arise by law,
by reason of custom or usage in the trade, or by course of dealing. "
(Sorry, I couldn't find SAS/Stat's lack of warranty. I found this one
at
http://support.sas.com/documentation/onlinedoc/sasc/doc700/html/common/agreement.htm)
Duncan Murdoch
And that's an issue that always comes up on Linux v. Microsoft -- just
because you pay money for it doesn't mean you're buying meaningful
guarantees.
--
Due to the recession, requests for instant gratification will be
deferred until arrears in scheduled gratification have been satisfied.
One would hope that if someone were to use software to "build engines
for aircraft", that said person would sufficiently test the software to
have confidence in it, whether it had a "Warranty" or not — at least
that's my mode of operation…
Cheers!
Tom
--
Thomas E Adams
National Weather Service
Ohio River Forecast Center
1901 South State Route 134
Wilmington, OH 45177
EMAIL: thomas...@noaa.gov
VOICE: 937-383-0528
FAX: 937-383-0033
> It is an ignorant comment by a marketing person who has been spoon fed
> her lines...it is also a comment being made from a very defensive and
> insecure posture.
To some extent but we should also realize that open source software is
a nonsensical idea to those in the commercial software business. It
just doesn't fit into their world view.
As part of the 40th anniversary of Technometrics there will be a
discussion article on "The Future of Statistical Computing" by Leland
Wilkinson in the Nov. 2008 issue. (I say "will be" because I don't
see it on the web site yet.) Lee is the creator of Systat and is now
associated with SPSS, Inc. which bought Systat. I am one of the
discussants and I agreed with most of what Lee had to say except with
regard to the role of open source software. Lee looked at the market
share of SAS, SPSS, Stata, S-PLUS, Minitab, etc. in statistical
software and based his projections on that. He had some ball park
figure for the "market share" of R and concluded that it wouldn't
really be important. My response was that this misses the point. R
is a community, not a "product" in the traditional software sense. I
referred to Eric Raymond's essay "The Cathedral and the Bazaar", which
I think is still relevant in contrasting the views of those in the
commercial software and the open source software communities.
> Congrats to R Core and the R Community. This is yet another sign of R's
> growth and maturity.
______________________________________________
At end we plot the raw data as well as the time
series of totals and show loess smooths for each.
By running the code below we see that the:
- sum of the three seems to be rising at a constant rate
- S is declining
- SAS and R are rising
- R is rising the fastest through its completed its phase
of highest growth which ended around 2004
tt3 <- structure(c(15, 458, 330, 219, 472, 517, 546, 511, 658, 681,
712, 751, 763, 975, 703, 805, 752, 666, 548, 734, 963, 792, 945,
1002, 775, 969, 745, 691, 773, 765, 853, 1024, 805, 1052, 1163,
999, 1184, 1053, 1176, 1197, 911, 844, 1007, 1150, 1108, 1315,
1212, 1127, 1074, 692, 947, 900, 853, 677, 894, 1068, 945, 784,
448, 813, 896, 823, 894, 1129, 733, 492, 514, 493, 659, 1077,
778, 540, 476, 612, 1351, 1708, 1720, 1595, 1720, 1519, 1177,
1163, 1963, 1615, 1572, 1696, 1491, 1669, 1490, 1298, 1826, 1537,
1915, 1467, 1735, 1905, 2027, 1976, 1439, 1592, 1636, 1424, 1941,
1845, 2010, 2199, 2373, 2133, 2445, 1492, 1864, 2133, 1663, 1520,
1832, 1846, 1755, 1757, 1863, 1701, 1926, 1689, 1646, 1832, 1545,
1445, 1636, 1652, 2188, 1826, 1836, 2606, 1843, 2143, 1784, 1712,
1786, 2148, 2122, 1960, 629, 2169, 2283, 2407, 2061, 1793, 1365,
1427, 1518, 1524, 2722, 1645, 1711, 2796, 3147, 2723, 761, 2027,
2714, 2983, 2848, 2374, 2750, 926, 1728, 2766, 2974, 2691, 2435,
2592, 1868, 2320, 2112, 1948, 2305, 2255, 2712, 2789, 2025, 2368,
2607, 2584, 2554, 2434, 1984, 1921, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
273, 378, 293, 330, 243, 219, 209, 191, 241, 181, 141, 210, 173,
313, 300, 334, 254, 284, 270, 300, 253, 300, 194, 264, 313, 285,
264, 306, 247, 245, 302, 204, 251, 261, 176, 246, 232, 252, 300,
331, 282, 258, 260, 260, 229, 232, 194, 230, 255, 242, 228, 219,
248, 230, 207, 221, 280, 228, 177, 189, 179, 218, 196, 189, 217,
221, 187, 186, 295, 197, 142, 197, 230, 257, 151, 164, 175, 154,
187, 195, 150, 176, 176, 174, 161, 193, 182, 174, 109, 159, 144,
107, 98, 82, 84, 109, 87, 99, 123, 107, 96, 84, 97, 68, 73, 53,
20, 51, 59, 74, 48, 46, 34, 47, 39, 35, 70, 56, 41, 48, 63, 58,
47, 31, 27, 40, 28, 41, 30, 27, 36, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 92, 36, 47, 41, 37, 40, 76, 61, 57, 135,
79, 114, 101, 90, 105, 110, 64, 94, 96, 184, 105, 226, 145, 195,
189, 161, 186, 184, 148, 203, 231, 318, 221, 205, 355, 377, 377,
504, 418, 293, 356, 434, 418, 433, 422, 558, 583, 651, 470, 552,
550, 615, 562, 678, 657, 825, 530, 884, 697, 880, 965, 1057,
926, 918, 824, 705, 1055, 1038, 742, 1017, 1137, 1203, 1488,
1268, 1319, 1344, 1210, 1443, 1567, 1605, 1158, 1116, 1580, 1946,
1657, 1561, 1714, 1618, 1493, 1534, 1712, 1895, 1481, 1746, 1724,
1703, 2057, 1887, 2056, 1872, 1777, 1709, 1810, 1907, 1508, 2075,
1920, 2270, 1818, 2029, 1811, 1785, 1898, 1902, 2328, 2127, 1450,
1714, 1907, 2191, 2145, 2210, 2307, 2138, 2241, 2028, 2708, 2594,
2028, 2490, 2583, 2740, 2487, 2517, 2774, 3268, 2813, 2990, 3037,
2730, 2399), .Dim = c(186L, 3L), .Dimnames = list(NULL, c("SAS",
"S", "R")), .Tsp = c(1993.5, 2008.91666666667, 12), class = c("mts",
"ts"))
tt4 <- cbind(tt3, rowSums(tt3))
colnames(tt4) <- c(colnames(tt3), "Sum")
ts.plot(tt4, col = 1:4)
grid()
legend("topleft", colnames(tt4), lty = 1, col = 1:4)
library(dyn)
for(i in 1:4) lines(fitted(dyn$loess(tt4[, i] ~ time(tt4))), col = i)
## SAS-L traffic
sas <- structure(list(Jan = c(NA, 546L, 548L, 853L, 1007L, 894L, 514L,
## s-news traffic
s <- structure(c(NA, 210, 264, 246, 230, 189, 197, 174, 109, 51, 48,
5, 273, 173, 313, 232, 255, 179, 230, 161, 87, 59, 63, NA, 378,
313, 285, 252, 242, 218, 257, 193, 99, 74, 58, NA, 293, 300,
264, 300, 228, 196, 151, 182, 123, 48, 47, NA, 330, 334, 306,
331, 219, 189, 164, 174, 107, 46, 31, NA, 243, 254, 247, 282,
248, 217, 175, 109, 96, 34, 27, NA, 219, 284, 245, 258, 230,
221, 154, 159, 84, 47, 40, NA, 209, 270, 302, 260, 207, 187,
187, 144, 97, 39, 28, NA, 191, 300, 204, 260, 221, 186, 195,
107, 68, 35, 41, NA, 241, 253, 251, 229, 280, 295, 150, 98, 73,
70, 30, NA, 181, 300, 261, 232, 228, 197, 176, 82, 53, 56, 27,
NA, 141, 194, 176, 194, 177, 142, 176, 84, 20, 41, 36, NA), .Dim = c(12L,
12L), .Dimnames = list(c("1998", "1999", "2000", "2001", "2002",
"2003", "2004", "2005", "2006", "2007", "2008", "2009"), c("Jan",
"Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec")))
r <- structure(c(NA, 135, 226, 205, 558, 884, 1017, 1116, 1746,
2075, 1714, 2490, 462, NA, 79, 145, 355, 583, 697, 1137, 1580, 1724,
1920, 1907, 2583, NA, NA, 114, 195, 377, 651, 880, 1203, 1946,
1703, 2270, 2191, 2740, NA, 92, 101, 189, 377, 470, 965, 1488,
1657, 2057, 1818, 2145, 2487, NA, 36, 90, 161, 504, 552, 1057,
1268, 1561, 1887, 2029, 2210, 2517, NA, 47, 105, 186, 418, 550,
926, 1319, 1714, 2056, 1811, 2307, 2774, NA, 41, 110, 184, 293,
615, 918, 1344, 1618, 1872, 1785, 2138, 3268, NA, 37, 64, 148,
356, 562, 824, 1210, 1493, 1777, 1898, 2241, 2813, NA, 40, 94,
203, 434, 678, 705, 1443, 1534, 1709, 1902, 2028, 2990, NA, 76,
96, 231, 418, 657, 1055, 1567, 1712, 1810, 2328, 2708, 3037,
NA, 61, 184, 318, 433, 825, 1038, 1605, 1895, 1907, 2127, 2594,
2730, NA, 57, 105, 221, 422, 530, 742, 1158, 1481, 1508, 1450,
2028, 2399, NA), .Dim = c(13L, 12L), .Dimnames = list(c("1997",
"1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005",
"2006", "2007", "2008", "2009"), c("Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
library(reshape)
sas <- melt(as.matrix(sas), na.rm = TRUE)
r <- melt(r, na.rm = TRUE)
s <- melt(s, na.rm = TRUE)
names(r) <- names(s) <- names(sas) <- c("year", "month", "count")
sas$software <- "sas"
s$software <- "s"
r$software <- "r"
all <- rbind(sas, s, r)
all$date <- with(all,
as.Date(paste(year, month, 15, sep = "-"), "%Y-%b-%d"))
library(ggplot2)
qplot(date, count, data = all, geom = "line", colour = software) +
geom_smooth(se = F, size = 1)
last_plot() + scale_y_log10(breaks = 10^(1:3), labels = 10^(1:3))
yearly <- ddply(all, .(year, software), function(df) c(count = sum(df$count)))
qplot(year, count, data = yearly, geom = "line", colour = software)
Hadley
The image is even more striking (and more accurately reflects
reality, I believe) if you add "log='y'" to "ts.plot".
Best Wishes,
Spencer