FYI: the thread started by Jannis and esp. Ripley's recommendation.
Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.
start.time<-Sys.time()
zxc1 <- unlist(strsplit(asd, "\\W+"))
end.time<-Sys.time()
end.time - start.time # Time difference of 0.2574487 secs
start.time<-Sys.time()
zxc2 <- unlist(strsplit(asd, "\\W+", perl=TRUE))
end.time<-Sys.time()
end.time - start.time # Time difference of 1.291877 mins
So, if, as it seems the current problem only applies to curly-bracket
repetition expressions, then in other contexts the variant without
perl=TRUE is indeed much better, thanks John!!
HOWEVER ... note that this is just because strsplit is sooooo
slooooowwwww. Look here:
start.time<-Sys.time()
asd.2 <- gsub("\\W+", "q1w2e3r4t5y6u7i8o9p0", asd, perl=TRUE) #
using gsub to find the "\\W+" and replace them with something not in
the file
zxc3 <- unlist(strsplit(asd.2, "q1w2e3r4t5y6u7i8o9p0")) # use THAT
for splitting
end.time<-Sys.time()
end.time - start.time # Time difference of 0.602056 secs, tadaaah
So, John is right: perl=TRUE can slow things down, but apparently
mostly (only?) for strsplit. Thus, with easy workarounds, we can make
sure that doesn't slow us down too much.
As usual, just my $0.02,
However,
- John's point is still valid in that strsplit IS very slow, which
many readers may have noticed working through the book already (e.g.,
when using it to split BNC files).
- my response to him is still valid in that it at least was intended
to make people aware of what I always exemplify in the workshops,
bootcamps, etc.: there is nearly always more than one way to do stuff
(often three, namely searching, replacing, or splitting) and often one
of them is easy to think of and another?) one is faster. In this case,
the most straightforward way to get the words is *splitting*, i.e., a
full-fledged strsplit, like John did it. But_1 you could also do the
main thing with *replacing*, i.e., use gsub and then an only
elementary strsplit afterwards. But_2 you could also do all this by
*searching*, i.e., instead of getting rid what you don't want
("\\W+"), you search for what you want using, say, exact.matches or
...
This now makes $0.04 ;-).
You could also try strapplyc which can be downloaded from the gsubfn
development repo (although its not part of the package yet). Its like
strapply in the gsubfn package but does not support all the arguments
that strapply supports. Most importantly the FUN argument is hard
coded to be c. The critical portion is written in tcl so it should be
reasonably fast. My laptop is not that fast and it can split the 275k
words of Ulysses in less than 3 seconds:
library(gsubfn)
# download and read in strapplyc
source("http://gsubfn.googlecode.com/svn/trunk/R/strapplyc.R")
# James Joyce, Ulysses
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joycec <- paste(joyce, collapse = " ")
system.time(s <- strapplyc(joycec, "\\w+")[[1]])
length(s) # 275546
- strapplyc: 0.3199644 secs
- strsplit(gsub(...)): 1.02107 secs
So, strapplyc seems to be a better alternative!