Whole word matching

7 views
Skip to first unread message

Julia Deak

unread,
Sep 22, 2008, 11:13:09 AM9/22/08
to CorpLing with R
Hi - a very simply question, but I'm hung up on it.

I have a vector (nouns) with a list of words I want to search for.

for (j in nouns){
output<-grep(j, student.lines.2, value=TRUE)}

This works, but it finds parts of words that match as well as whole
words. I need to limit it to whole words.

If I try to make a reg expr with "//W" or something on either side to
limit it to whole words, I can't figure out how to embed the variable
name "j" in the reg expr.

I know this is kind of a general Perl / Unix kind of question, but I
have no experience with that stuff and I'm not sure how to educate
myself that way.
Thanks for any help or workarounds you can provide. I remember there
was a function you wrote called "exact.matches," but that looked a bit
more complicated than what I need here, right?

thanks,
Julia Deak

Austin Frank

unread,
Sep 22, 2008, 12:03:30 PM9/22/08
to corplin...@googlegroups.com
On Mon, Sep 22 2008, Julia Deak wrote:

> Hi - a very simply question, but I'm hung up on it.
>
> I have a vector (nouns) with a list of words I want to search for.
>
> for (j in nouns){
> output<-grep(j, student.lines.2, value=TRUE)}
>
> This works, but it finds parts of words that match as well as whole
> words. I need to limit it to whole words.

You'll want to use perl-style regexes and make use of the word boundary
escape sequence. Something like (untested)

for (j in nouns){
output<-grep(paste("\b", j, "\b", sep=""),
student.lines.2,
value=TRUE, perl=TRUE)
}

HTH,
/au


--
Austin Frank
http://aufrank.net
GPG Public Key (D7398C2F): http://aufrank.net/personal.asc

Stefan Th. Gries

unread,
Sep 22, 2008, 1:43:50 PM9/22/08
to corplin...@googlegroups.com
Yes, basically Austin's version will do, but you will need to double
the backslashes to "\\b":

# example data
nouns<-c("car", "truck", "bike")
corpus<-c("I need to take care of my bike and my truck")

# loop and paste
for (j in nouns) {
output<-grep(paste("\\b", j, "\\b", sep=""), corpus, perl=T ,value=T)
cat("Search term:", j, "\t\toutput:\t", output, "\n", sep="")
}

HTH,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Julia Deak

unread,
Sep 22, 2008, 5:04:48 PM9/22/08
to CorpLing with R
Thanks very much! For some reason (j in nouns) didn't work in my case,
so it ended up being:

for (i in 1:length(nouns)){
current.search<-paste("\\b", nouns[i], "\\b", sep="")
output<-grep(current.search, student.lines.2, value=TRUE)
print(output[1])
}

I appreciate your time,
Julia Deak
Reply all
Reply to author
Forward
0 new messages