Retrieving contexts

82 views
Skip to first unread message

eloiz

unread,
Sep 25, 2011, 4:41:01 PM9/25/11
to CorpLing with R
Hi,

Apologies in advance for what is probably a very basic question about
creating concordances and retrieving preceding/subsequent contexts.

Qs: in reference to steps 4f and 7 of the script below, I can see how
in order to avoid the error message in 7 ( Error in i - 100 : non-
numeric argument to binary operator), the grep line in 4f needs
'value=F'. But then how do I get back to the actual matches (rather
than their locations) in the output?

Thanks a lot!!

Eloiz

#############################################
# (1) create an output file
output.file<-choose.files()

# (2) choose relevant corpus files
corpus.files<-choose.files()

# (3) define a vector all.matches.right for the matches
all.matches<-vector() # or character() go through all files

# (4) write a for-loop that
for (i in 1:length(corpus.files)) {

# (a) reads each corpus file into a vector current.corpus.file
current.corpus.file<-scan(corpus.files[i], what="char", sep="\n",
quiet=TRUE)

# (b) output.files a 'progress report'
cat(i, "\n")

# (c) retrieves only learners' turns (across the three speaking
tasks) and ...
right.lines<-grep("<B>", current.corpus.file, ignore.case=T,
perl=TRUE, value=TRUE)

# (d) removes all <B> tags and ...
right.lines2<-gsub("<B>", "", right.lines, perl=T)

# (e) removes all </B> tags
right.lines3<-gsub("</B>", "", right.lines2, perl=T)

# (f) retrieves can forms
current.matches<-grep("\\bXXX\\b", right.lines3, ignore.case=T,
value=TRUE) # ???

# (g) appends the matches to the vector all.matches.form (and
adds the filename to the matches)
current.matches2<-paste(corpus.files[i], current.matches, sep="\t")
all.matches.form<-append(all.matches.form, current.matches2)

} # end of for: go through all files

# (5) customize the output.file: tab stops around matches
all.matches.form2<-gsub("\\bXXX\\b", "\t\\1\t", all.matches.form,
perl=TRUE) # insert a tab stop before and after the matches

# (6) print the output.file and a column header for the
preceding/subsequent context of each match
cat("FILENAME\tPRECEDING_CONTEXT\tMATCH\tSUBSEQUENT_CONTEXT\n",
all.matches.form2, file=output.file, sep="n")

# (7) write a loop that
for (i in all.matches.form2) {

# (a) retrieves preceding/subsequent contexts of each match
cat(file=output.file, append=T,
all.matches.form2[max(0, i-100):max(0, i-1)],
"\t",
all.matches.form2[i],
"\t",
all.matches.form2[(i+1):min(i+100,
length(all.matches.form2)+1)],
"\n")
} # end of for ##### Error in i - 100 : non-numeric
argument to binary operator


# (8) output the results
cat(all.matches.form2, file=output.file, sep="\n")

#############################################

Stefan Th. Gries

unread,
Sep 25, 2011, 7:14:26 PM9/25/11
to corplin...@googlegroups.com
As far as I can see,

- 4f needs value=TRUE
- 5 needs this search expression: "\\b(XXX)\\b"
- 7 needs to be fixed because

- on the one hand, you're using a loop of this type

for (i in all.matches.form2) {

which means, i will take on all elements of

all.matches.form2

but on the other hand you're using i-100 (as if you had used a
loop of this type:

for (i in seq(all.matches.form2)) {
# or
for (i in 1:length(all.matches.form2)) {

which would mean, i will take on numbers from 1 to ...

Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

John Newman

unread,
Sep 25, 2011, 8:31:40 PM9/25/11
to corplin...@googlegroups.com
Eloiz

When it comes to creating concordances using R, of course you can write your own script. But I find Stefan Gries' exact.matches script (which is part of the arsenal of scripts accompanying his Quantitative Corpus Linguistics with R book) the most practical. Here's an example of running that script on Jane Austen's Emma:

> text = scan("/Users/johnnewman/Desktop/Emma_bits/Emma.txt", what = "char", sep = "\n")

> results <- exact.matches("(can)|(could)", text)

> (results$`exact matches`)[1:10]
 [1] "could" "could" "could" "could" "could" "could" "could" "could" "could" "can"  

> (results$`locations of matches`)[1:10]
 [1]  66  67  73  80  92  93  99 103 107 111

> results$`lines with delimited matches`[1:10] 
 [1] "Line: 66\thers--one to whom she \tcould\t speak every thought as it arose, and who had"
 [2] "Line: 67\tsuch an affection for her as \tcould\t never find fault."                    
 [3] "Line: 73\tdearly loved her father, but he was no companion for her. He \tcould\t not"  
 [4] "Line: 80\ttemper, his talents \tcould\t not have recommended him at any time."         
 [5] "Line: 92\tnot one among them who \tcould\t be accepted in lieu of Miss Taylor for even"
 [6] "Line: 93\thalf a day. It was a melancholy change; and Emma \tcould\t not but sigh over"
 [7] "Line: 99\treconciled to his own daughter's marrying, nor \tcould\t ever speak of her"  
 [8] "Line: 103\tother people \tcould\t feel differently from himself, he was very much"     
 [9] "Line: 107\tas she \tcould\t, to keep him from such thoughts; but when tea came, it was"
[10] "Line: 111\t\"I \tcan\tnot agree with you, papa; you know I cannot. Mr. Weston is such" 

> cat((results$`lines with delimited matches`[1:10]), sep = "\n") #for another view in the console
Line: 66 hers--one to whom she could speak every thought as it arose, and who had
Line: 67 such an affection for her as could never find fault.
Line: 73 dearly loved her father, but he was no companion for her. He could not
Line: 80 temper, his talents could not have recommended him at any time.
Line: 92 not one among them who could be accepted in lieu of Miss Taylor for even
Line: 93 half a day. It was a melancholy change; and Emma could not but sigh over
Line: 99 reconciled to his own daughter's marrying, nor could ever speak of her
Line: 103 other people could feel differently from himself, he was very much
Line: 107 as she could , to keep him from such thoughts; but when tea came, it was
Line: 111 "I can not agree with you, papa; you know I cannot. Mr. Weston is such


So, I'd suggest relying on exact.matches working on a vector of lines for your concordance.

John



--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.




--
John Newman
Professor and Chair
Department of Linguistics, 4-32 Assiniboia Hall, University of Alberta
Edmonton T6G 2E7 CANADA
Fax: (780) 492-0806, Tel: (780) 492-5500
Homepage: http://johnnewm.jimdo.com

Stefan Th. Gries

unread,
Sep 25, 2011, 8:37:24 PM9/25/11
to corplin...@googlegroups.com
One thing I should do when I get around to it improve exact.matches so
that it can return more context, e.g., by adding a user-specified
number of vector elements before and after the one with the match.
It's actually not that difficult, I just never have the time to do it
:-|. But John's right: if you don't need more context than
exact.matches would be an easier way!

eloiz

unread,
Sep 27, 2011, 7:01:04 PM9/27/11
to CorpLing with R
Stefan and John,

Thanks a lot for your input!

Unfortunately, exact.matches is a little limited in my case since I
need about 150 words of preceding and subsequent contexts and I also
need filenames.

Thanks again!
Eloiz

Stefan Th. Gries

unread,
Sep 27, 2011, 7:06:47 PM9/27/11
to corplin...@googlegroups.com
> Unfortunately, exact.matches is a little limited in my case since I need about 150 words of preceding and subsequent contexts
Ye, I thought so, hence my last comment

> and I also need filenames.

That is not an argument against exact.matches because you can just
paste those together with (tweaked) output from exact.matches()[[4]],
I do that all the time.

eloiz

unread,
Sep 28, 2011, 3:33:49 PM9/28/11
to CorpLing with R
Apologies for this additional question on the same topic:

Would anyone have any tips about how, in the case of a single
concordance line showing for instance three matches, how to repeat the
second and third matches in the 'matches' column so as to enable their
annotation?

I naively thought that the script below would do it but I was wrong.


for (i in seq(all.matches)) {

cat(file=output.file, append=T,
all.matches[max(0, i-100):max(0, i-1)],
"\t",
all.matches[i],
"\t",
all.matches[(i+1):min(i+100, length(all.matches)+1)],
"\n")
}


Thanks a lot!

Eloiz

Stefan Th. Gries

unread,
Sep 28, 2011, 3:41:17 PM9/28/11
to corplin...@googlegroups.com
Again, exact.matches comes to the rescue:

x <- c("This is stupid example where, in an example, the word example
shows up three times.")
exact.matches("\\bexample\\b", x)[[4]]

Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Earl Brown

unread,
Oct 14, 2011, 1:49:34 AM10/14/11
to CorpLing with R
Eloiz, one simplistic way to get more surrounding context is to
collapse the corpus file into a single vector before running
exact.matches:

corpus.file<-scan("/Users/brow2662/Desktop/Chiara08_utf8.txt",
what="char", sep="\n")

corpus.file<-paste(corpus.file, collapse=" ")

hits<-exact.matches(search.expression, corpus.file)[[4]]

This would give you an over load of surrounding context as it would
give you the entire corpus file as the surrounding context for each
match. You could eliminate the excess in a spreadsheet with "right"
and "left" functions. You could also eliminate the excess in R by
adding an argument to exact.matches that delimits the amount of
surrounding context to be pulled around each match.

Best of luck, Earl Brown

eloiz

unread,
Oct 15, 2011, 4:25:57 AM10/15/11
to CorpLing with R
Thanks a lot for that, Earl!

I have just tried your suggestion and it is working perfectly!

Thanks!

Eloiz

Earl Brown

unread,
Apr 25, 2013, 5:49:38 PM4/25/13
to corplin...@googlegroups.com
Hello Corpus Linguistics Rists,

I recently modified Stefan (terrific) function exact.matches to add some additional functionality. I've added three arguments: L1, R1, and sur.wds. L1 = TRUE puts the contextualized matches returned in the fourth element of the returned list in order by one collocate to the left. R1 = TRUE does the same for one collocate to the right. When both are true it first orders them by L1 and then by R1.

When sur.wds = TRUE, L1 and R1 are put into separate columns to the right of the contextualized matches.

You need to have the packages "gsubfn", "data.table", and "gdata" downloaded from CRAN before running my modified version of exact.matches, which I renamed exactMatches to differentiate it. Here's the whole function:


exactMatches <- function (search.expression, corpus.vector, pcre = TRUE, case.sens = TRUE, characters.around = 0, lines.around = 0, clean.up.spaces = TRUE, L1 = FALSE, R1 = FALSE, sur.wds = FALSE) {
   # Thanks to Earl Brown for feedback on an earlier version
   if (characters.around!=0 & lines.around!=0) {
      stop("At least one of 'characters.around' and 'lines.around' has to be zero ...")
   }
   line.numbers.with.matches <- grep(search.expression, corpus.vector, perl=pcre, value=FALSE, ignore.case=!case.sens) # the numbers of lines that contain matches
   if (any(line.numbers.with.matches)) { # if there are matches ...
      if (characters.around!=0) {
         lines.with.matches <- gsub("( ?_qW1aS3zX5eR7dF9cV_|_qW1aS3zX5eR7dF9cV_ ?)",
                                    " ",
                                    paste(corpus.vector, collapse = "_qW1aS3zX5eR7dF9cV_"),
                                    perl=TRUE)
      } else {
         lines.with.matches <- corpus.vector[line.numbers.with.matches] # the lines that contain matches
      }
      matches <- gregexpr(search.expression, lines.with.matches, perl = pcre, ignore.case = !case.sens) # the start positions and lengths of matches
      number.of.matches <- sapply(matches, length) # the number of matches per line (of the lines that have matches)
      lines <- rep(lines.with.matches, number.of.matches) # the lines with matches, each as many times as it has matches
      line.numbers.with.matches <- rep(line.numbers.with.matches, number.of.matches)
      starts <- unlist(matches) # starting positions of matches
      stops <- starts + unlist(sapply(matches, attr, "match.length")) - 1 # end positions of matches
      exact.string.matches <- substr(lines, starts, stops) # the exact matches
      lines.with.delimited.matches <- paste( # the lines with the tab-delimited matches
         substr(lines, if (characters.around!=0) starts-characters.around else 1, starts-1), "\t", # preceding contexts
         exact.string.matches, "\t", # matches
         substr(lines, stops+1, if (characters.around!=0) stops+characters.around else nchar(lines)), # subsequent contexts
         sep="")
     
     
        ####################
        # Earl's attempt to order the contextualized results by collocates
        # and to put the surrounding words in their own columns

        library("gsubfn")
        library("data.table")
        library("gdata")   

        # order by one word to the right     
        if (R1) {
            rogue.tabs <- grep("\t", lines.with.matches)
            if (any(rogue.tabs)) stop("You have some rogue tabs in your input. You need to delete them before proceeding as they will jack up the output.")
            temp <- strsplit(lines.with.delimited.matches, "\t")
            fol.wd <- sapply(temp, function(x) strapplyc(trim(x[3]), "^[ \\.,?!:;\"\']*(\\w+)", backref = 1))
            dd <- data.table(x = 1:length(fol.wd), y = format(fol.wd))
            new.order <- dd[order(y)]$x
            lines.with.delimited.matches <- lines.with.delimited.matches[new.order]
            line.numbers.with.matches <- line.numbers.with.matches[new.order]
            starts <- starts[new.order]                   
        }
       
        # order by one word to the left
        if (L1) {
            rogue.tabs <- grep("\t", lines.with.matches)
            if (any(rogue.tabs)) stop("You have some rogue tabs in your input. You need to delete them before proceeding as they will jack up the output.")
            temp <- strsplit(lines.with.delimited.matches, "\t")
            prev.wd <- sapply(temp, function(x) strapplyc(trim(x[1]), "(\\w+).?$", backref = 1))
            dd <- data.table(x = 1:length(prev.wd), y = format(prev.wd))
            new.order <- dd[order(y)]$x
            lines.with.delimited.matches <- lines.with.delimited.matches[new.order]
            line.numbers.with.matches <- line.numbers.with.matches[new.order]
            starts <- starts[new.order]         
        }
       
        # puts surrounding words in their own columns
        if (sur.wds) {
               rogue.tabs <- grep("\t", lines.with.matches)
            if (any(rogue.tabs)) stop("You have some rogue tabs in your input. You need to delete them before proceeding as they will jack up the output.")
            temp <- strsplit(lines.with.delimited.matches, "\t")
            prev.wd <- sapply(temp, function(x) strapplyc(trim(x[1]), "(\\w+).?$", backref = 1))
            prev.wd <- sub("character(0)", "NA", format(prev.wd))
            fol.wd <- sapply(temp, function(x) strapplyc(trim(x[3]), "^[ \\.,?!:;\"\']*(\\w+)", backref = 1))
            fol.wd <- sub("character(0)", "NA", format(fol.wd))
            lines.with.delimited.matches <- paste(lines.with.delimited.matches, prev.wd, fol.wd, sep = "\t")
        }

        # end Earl's attempt
        ####################
     
     
      if (lines.around!=0) {
         corpus.vector <- append(corpus.vector, rep("", lines.around))
         starts.of.previous.lines <- pmax(0, line.numbers.with.matches - lines.around)
         ends.of.subsequent.lines <- pmin(line.numbers.with.matches + lines.around, length(corpus.vector))
         for (current.line.with.delimited.match in seq(lines.with.delimited.matches)) {
            lines.with.delimited.matches[current.line.with.delimited.match] <- paste(
               paste(corpus.vector[starts.of.previous.lines[current.line.with.delimited.match]:(line.numbers.with.matches[current.line.with.delimited.match]-1)], collapse=" "),
               lines.with.delimited.matches[current.line.with.delimited.match],
               paste(corpus.vector[(line.numbers.with.matches[current.line.with.delimited.match]+1):ends.of.subsequent.lines[current.line.with.delimited.match]], collapse=" "),
               sep=" ")
         }
      }
     
      # cleaning output as necessary/requested by user
      if (clean.up.spaces) { # clean up spaces around tabs
         lines.with.delimited.matches <- gsub(" *\t *", "\t", lines.with.delimited.matches, perl=TRUE)
      }
      lines.with.delimited.matches <- gsub("(^ {1,}| {1,}$)", "", lines.with.delimited.matches, perl=TRUE) # clean up leading and trailing spaces
      output.list <- list(exact.string.matches,
                          if (characters.around!=0) starts else line.numbers.with.matches, # starting character positions or the numbers of lines with matches, each as many times as it has matches
                          length(unique(line.numbers.with.matches))/sum(nzchar(corpus.vector)),
                          lines.with.delimited.matches,
                          c(Pattern = search.expression, "Corpus (1st 100 char.)"=substr(paste(corpus.vector, collapse=" "), 1, 100), PCRE=pcre, "Case-sensitive"=case.sens),
                          "1.2 (17 August 2012)")
      names(output.list) <- c("Exact matches",
                              paste("Locations of matches (", ifelse (characters.around!=0, "characters", "lines"), ")", sep=""),
                              "Proportion of non-empty corpus parts with matches",
                              "Lines with delimited matches",
                              "Search parameters",
                              "Version (date)")
      return(output.list)
   }
}

Stefan Th. Gries

unread,
Apr 26, 2013, 11:42:52 AM4/26/13
to corplin...@googlegroups.com
I like the general idea of making exact.matches richer and thanks for
sharing this. One thing I am concerned with (and this is one reason
why my functions often don't provide more of this type of convenience)
is that the way this new version is written now slowly begins to, I
think, sacrifice precision and
forcing-the-user-to-make-their-own-decisions for the benefit of
convenience. In this case, for instance, I am 'worried' about how the
use of \\w for finding L1 and R1

- 'relieves' the user of the need to think about what they are doing
(in a way not unlike that of some ready-made software);
- will of course totally not work with many encodings/characters;
- jeopardizes part of 'my mission' is to make people (want to) realize
what they are doing.

Don't get me wrong, I know that, unlike, say, MonoConc, any user can
tweak the code because this is open source - I just fear that users
will go "oh cool, now eM even does this without me having to worry
about 'anything'." I would therefore prefer an implementation where
the user is 'forced' to enter the characters that are possible
delimiters between the node and L1/R1 when he calls the function
rather than have a script decide for the user what to do with "'",
"-", ..., and this is why my scripts usually (fingers crossed I am not
missing something here) avoid making such decisions.

One other slight quibble: I myself prefer to not use other packages.
There are some cases where I haven't managed that myself, like my
cluster analysis scripts, but in this case I think nearly everything
the functions from gdata etc. do should be doable without those (or by
including that code in eM, with the proper citation of course).

Still, while I begin to differ in terms of 'policy', this is a nice
piece of code. At some point, I may make available my own current
version of exact.matches, which handles repeated overlapping matches.
but I am still testing it (every now and then, that's why it takes so
long).

Just my $0.02,

Earl Brown

unread,
Apr 28, 2013, 11:58:16 PM4/28/13
to corplin...@googlegroups.com
Fair enough.

Yeah, I realized, even in the moment, that "\\w+" wouldn't work for everyone, but figured users could change it as needed. I posted it for users to see my approach to ordering the contextualized results by collocates. I hope, like you, that the users who actually end up using the code will take a minute to look at it and figure out what it's doing, and tweak it as needed. That's how I've learned about scripting with R: by tweaking and combining and recombining parts of previous code to get exactly what I need.

Thanks again for the great function.

Stefan Th. Gries

unread,
Apr 29, 2013, 9:20:58 AM4/29/13
to corplin...@googlegroups.com
Yes, that would be the best outcome indeed, :-) and thanks again for
sharing this!
Reply all
Reply to author
Forward
0 new messages