A good reason to always use perl=TRUE

Stefan Th. Gries

unread,

Dec 10, 2011, 11:45:00 AM12/10/11

to CorpLing with R

http://markmail.org/search/?q=R+list%3Aorg.r-project.r-help+order%3Adate-backward+unexpected+sub#query:R%20list%3Aorg.r-project.r-help%20order%3Adate-backward%20unexpected%20sub+page:1+mid:mzdqzgm5sigm6l2r+state:results

FYI: the thread started by Jannis and esp. Ripley's recommendation.

Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

John Newman

unread,

Dec 10, 2011, 3:58:40 PM12/10/11

to corplin...@googlegroups.com

I can see how using perl=TRUE fixes the problem alluded to in that post that Stefan directed us to.

On the other hand, I can think of one situation where you would be wise *not* to use perl=TRUE. It concerns strsplit, but I imagine the issue would be relevant to some other functions as well. Here's a result from scanning in about half of Jane Austen's novel Emma and then running a script which times the process:

> emma.half <-scan("/Users/johnnewman/Desktop/Emma_bits/Emma3.txt", what = "char", sep = "\n")

Read 7509 items

> emma.half.string <- paste(emma.half, collapse = " ")

> start.time<-Sys.time()

> emma.half.words <-unlist(strsplit(emma.half.string, "\\W+"))

> end.time<-Sys.time()

> (end.time - start.time)

Time difference of 0.2605889 secs

> start.time<-Sys.time()

> emma.half.words <-unlist(strsplit(emma.half.string, "\\W+", perl = TRUE))

> end.time<-Sys.time()

> (end.time - start.time)

Time difference of 1.191505 mins

So, without Perl=TRUE, it took less than second; with Perl=TRUE more than a minute. You can easily imagine the implicaitons for working with even a small corpus like Brown.

John

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.

--
John Newman
Professor
Department of Linguistics, 4-32 Assiniboia Hall, University of Alberta
Edmonton T6G 2E7 CANADA
Fax: (780) 492-0806, Tel: (780) 492-0804
Homepage: http://johnnewm.jimdo.com

Stefan Th. Gries

unread,

Dec 10, 2011, 4:22:27 PM12/10/11

to corplin...@googlegroups.com

Yes, that IS a big difference, just replicated that here with Brown A:

start.time<-Sys.time()
zxc1 <- unlist(strsplit(asd, "\\W+"))
end.time<-Sys.time()
end.time - start.time # Time difference of 0.2574487 secs

start.time<-Sys.time()
zxc2 <- unlist(strsplit(asd, "\\W+", perl=TRUE))
end.time<-Sys.time()
end.time - start.time # Time difference of 1.291877 mins

So, if, as it seems the current problem only applies to curly-bracket
repetition expressions, then in other contexts the variant without
perl=TRUE is indeed much better, thanks John!!

HOWEVER ... note that this is just because strsplit is sooooo
slooooowwwww. Look here:

start.time<-Sys.time()
asd.2 <- gsub("\\W+", "q1w2e3r4t5y6u7i8o9p0", asd, perl=TRUE) #
using gsub to find the "\\W+" and replace them with something not in
the file
zxc3 <- unlist(strsplit(asd.2, "q1w2e3r4t5y6u7i8o9p0")) # use THAT
for splitting
end.time<-Sys.time()
end.time - start.time # Time difference of 0.602056 secs, tadaaah

So, John is right: perl=TRUE can slow things down, but apparently
mostly (only?) for strsplit. Thus, with easy workarounds, we can make
sure that doesn't slow us down too much.

As usual, just my $0.02,

Marco Schilk

unread,

Dec 10, 2011, 4:25:00 PM12/10/11

to corplin...@googlegroups.com

Mhh, yes. Maybe I did not get the original argument well enough. But was it meant to say that perl=T is better in each and every case??? If that were true perl=F should not be an option. And in the case you describe, I cannot think of a reason why I would want perl=T. So, yes perl=T naturally is slower, but for just splitting at non-word chars, there is no reason to use it anyway... If, however I want to use more complicated regex functions for, say, splitting a tagged corpus at very specific cutoff points perl=T may make sense even if it takes longer.... Or do I misunderstand the whole thing???

Cheers,

Marco

Dr. Marco Schilk
Akademischer Rat a.Z.

English Language and Linguistics
Justus Liebig University Giessen
Otto-Behaghel Str. 10b
D-35394 Giessen
tel: +49-641-99 30152
email: marco....@anglistik.uni-giessen.de

Stefan Th. Gries

unread,

Dec 10, 2011, 4:34:48 PM12/10/11

to corplin...@googlegroups.com

Well, the exchange on the R-help list was certainly NOT meant to
suggest perl=TRUE is ALWAYS better. Mine kinda implied that, though,
but, to clarify, that is more to my laziness/inability to bear in mind
which regexes are PCRE and which are not ;-) So, maybe I should have
worded my subject like more carefully, something like "A good reason
to always use perl=TRUE (when you're as lazy as me)".

However,

- John's point is still valid in that strsplit IS very slow, which
many readers may have noticed working through the book already (e.g.,
when using it to split BNC files).
- my response to him is still valid in that it at least was intended
to make people aware of what I always exemplify in the workshops,
bootcamps, etc.: there is nearly always more than one way to do stuff
(often three, namely searching, replacing, or splitting) and often one
of them is easy to think of and another?) one is faster. In this case,
the most straightforward way to get the words is *splitting*, i.e., a
full-fledged strsplit, like John did it. But_1 you could also do the
main thing with *replacing*, i.e., use gsub and then an only
elementary strsplit afterwards. But_2 you could also do all this by
*searching*, i.e., instead of getting rid what you don't want
("\\W+"), you search for what you want using, say, exact.matches or
...

This now makes $0.04 ;-).

Marco Schilk

unread,

Dec 10, 2011, 4:44:12 PM12/10/11

to corplin...@googlegroups.com

Thanks for what almost amounts to a counterfeiters dollar.... The last .02 confirm what I kept experiencing with my smallscale work on ICE-corpora... Sometimes the first idea is not the best. When I think of the question I posted a couple of weeks ago, I had to find out the hard way that what seemed like a good idea at the time is not possible with 2011/12 memory. Rethinking things and searching for the non-obvious solution helped (although the obvious one would have done more, I think ;))
adding smallchange,
Marco

ggrothendieck

unread,

Dec 11, 2011, 11:26:22 PM12/11/11

to CorpLing with R

You could also try strapplyc which can be downloaded from the gsubfn
development repo (although its not part of the package yet). Its like
strapply in the gsubfn package but does not support all the arguments
that strapply supports. Most importantly the FUN argument is hard
coded to be c. The critical portion is written in tcl so it should be
reasonably fast. My laptop is not that fast and it can split the 275k
words of Ulysses in less than 3 seconds:

library(gsubfn)
# download and read in strapplyc
source("http://gsubfn.googlecode.com/svn/trunk/R/strapplyc.R")

# James Joyce, Ulysses
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joycec <- paste(joyce, collapse = " ")

system.time(s <- strapplyc(joycec, "\\w+")[[1]])
length(s) # 275546

Stefan Th. Gries

unread,

Dec 11, 2011, 11:54:02 PM12/11/11

to corplin...@googlegroups.com

Thanks, Gabor, for pointing this out. I just compared your strapplyc
against the strsplit(gsub(...)) approach using the Brown corpus (as
before, but on a different computer). The result:

- strapplyc: 0.3199644 secs
- strsplit(gsub(...)): 1.02107 secs

So, strapplyc seems to be a better alternative!

Reply all

Reply to author

Forward