Dispersion of words across corpus files

247 views
Skip to first unread message

Earl Brown

unread,
Aug 8, 2012, 3:07:08 PM8/8/12
to corplin...@googlegroups.com
Hello Rists. I'm trying to make a frequency list with a column that gives the number of corpus files that each word in the freq list is attested in. For example, I have 11,000 corpus files. If the word "tree" is found in 5,500 of the corpus files, I want the script to tell me that the dispersion of "tree" is 50%, meaning that it occurs in 50% of the corpus files (5,500 / 11,000 = 0.5). I'm having a hard time conceptualizing how to even approach this, let alone which functions I'll need. Any ideas? Thanks in advance. Earl Brown

Stefan Th. Gries

unread,
Aug 8, 2012, 6:35:25 PM8/8/12
to corplin...@googlegroups.com
Before you loop over your n corpus files, you generate a collector

keep.track <- rep(0, n)

And then, within the loop i from 1 to n that loads the file and
searches for stuff, whenever stuff is found (i.e., length(grep(stuff,
corpus.file))>0), you set

keep.track[i] <- 1

After the loop, the dispersion is sum(keep.track) / n. (This is
actually very similar to the pragmatics 1 assignment on the companion
website ;-))

Of course,, you should really use better dispersion statistics than
just this one (range):

- <http://www.linguistics.ucsb.edu/faculty/stgries/research/2008_STG_Dispersion_IJCL.pdf>
and <http://www.linguistics.ucsb.edu/faculty/stgries/research/2012_JL_STG_DispersionCorrection_IJCL.pdf>
- <http://www.linguistics.ucsb.edu/faculty/stgries/research/2010_STG_DispersionAdjFreq_CorpLingAppl.pdf>
- plus, note the web resources that come with these papers (cf. my
research page).

Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
> --
> You received this message because you are subscribed to the Google Groups
> "CorpLing with R" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/corpling-with-r/-/MXNUIzvs6L8J.
> To post to this group, send email to corplin...@googlegroups.com.
> To unsubscribe from this group, send email to
> corpling-with...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/corpling-with-r?hl=en.

Earl Brown

unread,
Aug 9, 2012, 2:27:39 PM8/9/12
to corplin...@googlegroups.com
Thanks for the help, and especially the dispersion1() function on your website. Earl Brown

Earl Brown

unread,
Aug 12, 2012, 2:51:17 PM8/12/12
to corplin...@googlegroups.com
Problem with dispersions1():

I've gotten dispersions1() to work on a small subset of my corpus (ten files, 18k words), but when I run the whole corpus (18 million words in 11 thousand files), dispersions1() never finishes with the first word, even after hours (as many as ten hours). My computer has 4G of memory. Here's my sessionInfo():
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

Any advice on how to get dispersions1() to work with 11K files and 18 million words? Can I break up the corpus into smaller sub-corpora and run dispersions1() and then combine the DPs of the various sub-corpora together at the end? I assume not, but I can't figure out why either dispersions1() or my computer's memory cannot get through the first word. If it's helpful, the "corpus" argument has an exact length of 18,459,067 and the "corpus.part.sizes" argument has an exact length of 11,650.

Thanks. Earl Brown

Stefan Th. Gries

unread,
Aug 12, 2012, 3:38:32 PM8/12/12
to corplin...@googlegroups.com
Hm, I dunno, maybe the RAM is too little? As you may have seen on the
website, I did get it to run once with the 10m words in the spoken
BNC. I don't know anymore how long that took - was a while tho - but
definitely NOT 10hrs+ for one word! (And I did that on a low-tech ACER
with only 3GB RAM. Don't know ...

Earl Brown

unread,
Aug 15, 2012, 10:57:20 PM8/15/12
to corplin...@googlegroups.com
I narrowed it down to the "for" loops in the function. With 11000+ corpus files the function had to iterate over each one for every one of the several thousand words, bigrams, and trigrams I wanted to get dispersion info on. As I was only interested in getting range and DP I modified the function to include only the code that was necessary to get those two dispersion measurements. This excluded the "for" loops. It's quicker now, but the tapply() in the following line also takes a while to complete:

    v <- tapply(corpus==element, corpus.parts, sum) # v

but at least it progresses. It will take several days for the several thousand single words, even more bigrams, and even more trigrams in the 11000+ files that I want to get dispersion info on, but it looks like it will finish (eventually). I also had to save as much as possible in temporary files on the hard drive rather than using collector vectors in the RAM memory. That's probably what is actually taking so long. Anyway, thanks again for the dispersion1() function. Earl Brown

Stefan Th. Gries

unread,
Aug 16, 2012, 12:13:37 AM8/16/12
to corplin...@googlegroups.com
Let me also ask: did you have two loops, one over words, one over
files? If yes, were they nested like this

for (i in files) {
for (j in ngrams) {
get dispersions
}
}

or like this

for (i in ngrams) {
for (j in files) {
get dispersions
}
}

?

Earl Brown

unread,
Aug 17, 2012, 11:44:25 AM8/17/12
to corplin...@googlegroups.com
I took out the following "for" loops (and a lot more of the code) from the function, after copying the function from:
http://www.linguistics.ucsb.edu/faculty/stgries/research/dispersion/_dispersion1.r

 if(length(names(v)[v>1])==0) {
      numerator <- 0
   } else {
      distances.nearest.neighbors <- vector(length=sum(v[v>1])) # define a vector in which the minimal distances will be collected
      position <- 0 # set a position at which to store the current minimal distance

      for (i in names(v)[v>1]) { # loop over all corpus parts
         current.corpus.part <- corpus[corpus.parts==as.numeric(i)] # look only at the i-th corpus part
         current.element.positions <- which(current.corpus.part==element) # determine the locations of the element in this part

         # create a vector that contains all positions of the element in question and that is
         # prefixed by a number whose diff with the first position gives the length of the corpus part
         # suffixed by a huge number so that the minimum distance of the last occurrence can only be to the preceding element
         difference.builder <- c((current.element.positions[1]-length(current.corpus.part)), current.element.positions, 2*length(current.corpus.part))

         for (j in 1:(length(difference.builder)-2)) { # loop over this vector
            position <- position+1 # increment the position at which to store the current minimal distance
            distances.nearest.neighbors[position] <- min(diff(difference.builder[j:(j+2)])) # determine the minimal distance and store it
         } # end of for: loop over this vector
      } # end of for: loop over all corpus parts
      numerator <- sum(1/distances.nearest.neighbors)/length(distances.nearest.neighbors) # Washtell, p.c. 7/29/2008
   }
   values[["Washtell's Self Dispersion"]] <- numerator/denominator

The following is what I have left of the function to only return the range and DP:

dispersions1 <- function(corpus, corpus.part.sizes, element) {

    if(sum(corpus.part.sizes)!=length(corpus)) stop("The sum of the sizes of the corpus parts is not identical to the corpus size!")
    corpus.parts <- rep(1:length(corpus.part.sizes), corpus.part.sizes)

    n <- length(corpus.part.sizes) # n
    l <- length(corpus) # l
    f <- sum(corpus==element) # f
    if(f==0) { return("NA"); break() }
    s <- expected.proportions <- corpus.part.sizes/l # s

   
    v <- tapply(corpus==element, corpus.parts, sum) # v

    values <- list()   
    values[["element"]] <- element
    values[["range"]] <- sum(v>0)
    values[["Deviation of proportions DP"]] <- sum(abs((v/f)-s))/2 # Gries

    return(values)
}

Do you see a way to make this line faster?:


    v <- tapply(corpus==element, corpus.parts, sum) # v

With 11,000+ files (or more specifically, 11,000+ elements in the "corpus.parts" vector) this tapply() line takes up to 8 second per word, bigram, and trigram. With tens of thousands of words, bigrams, and trigrams, it will take a long time.

Stefan Th. Gries

unread,
Aug 17, 2012, 7:09:08 PM8/17/12
to corplin...@googlegroups.com
> Do you see a way to make this line faster?:
> v <- tapply(corpus==element, corpus.parts, sum) # v
Hm, nothing of the top of my head. Maybe this is a case where parallel
computing type of stuff (that I know very little about) might be
useful. Things like this maybe:
<http://www.inside-r.org/packages/cran/multicore/docs/mclapply>

Earl Brown

unread,
Aug 18, 2012, 2:29:39 AM8/18/12
to corplin...@googlegroups.com
It is curious to see that table() is (slightly) faster than tapply(..., sum):

> num.files <- 11650
> num.words <- 22000000
> aa <- sample(c(TRUE, FALSE), num.words, replace=T)
> bb <- rep(1:num.files, length.out=num.words)
>
> system.time({
+     cc <- tapply(aa, bb, sum)
+ })
   user  system elapsed
 13.235   0.878  14.242
>
> system.time({
+     dd <- table(aa, bb)[2,]
+ })
   user  system elapsed
 12.471   0.572  12.934

I found this post about making tapply() more efficient:

https://stat.ethz.ch/pipermail/r-help/2009-March/191049.html

but I couldn't make xtabs(~ aa + bb, sparse=T) quicker than either table() or tapply(..., sum). I also couldn't make the mclapply from the package "multicore" work. I'll likely plow ahead with table().

Stefan Th. Gries

unread,
Aug 18, 2012, 10:11:29 AM8/18/12
to corplin...@googlegroups.com
Nice, an improvement of 9% or so is nothing to sneeze at given the
size of the data. I changed the function on the website to use
table(...)[2,]!
Thanks,

Earl Brown

unread,
Sep 23, 2012, 1:19:29 AM9/23/12
to corplin...@googlegroups.com
We have a new and very clear winner in the benchmarking tests for making dispersions1() quicker: rowsum()

> num.files <- 10000
> num.words <- 10000000

> aa <- sample(c(TRUE, FALSE), num.words, replace=T)
> bb <- rep(1:num.files, length.out=num.words)
>
> system.time({
+     cc<-tapply(aa, bb, sum)

+ })
   user  system elapsed
  9.770   0.253   9.952
>
> system.time({
+     dd<-table(aa, bb)[2,]

+ })
   user  system elapsed
  9.083   0.269   9.298
>
> system.time({
+     ee<-rowsum(as.numeric(aa), bb)

+ })
   user  system elapsed
  1.013   0.050   1.057
>

I saw Gabor's response to Stefan's response to a question on R-help about combining two frequency lists. Gabor suggested using rowsum() and I looked at the doc for that function and noticed that it calls tapply(..., sum) a "slower version" of rowsum(). I updated my local (abbreviated) version of dispersions1() and ran some benchmarking tests and was able to get through a subset of my large dataset over fives times faster with rowsum() than with table()[2,]. The value that rowsum() returns is of class "matrix", but that doesn't seem to affect the calculations performed on the variable "v" to get the range and DP within dispersions1(). I only wish I had discovered sooner, but hey, now I (and we) know.

Stefan Th. Gries

unread,
Sep 23, 2012, 10:46:39 AM9/23/12
to corplin...@googlegroups.com
Thanks Earl, for testing and letting us know. It's interesting because
I really did not expect that to make such a difference. Just a few
days ago I ran a script that has the all words, all tags, and all
lemmas of the BNC XML in three vectors and where I do logical tests
such as words=="happy" and then some stuff with the resulting logical
vectors, and in all these lines, it's always the logical expression
that costs by far the most time so I never thought that it could be
sped up like that. Well, goes to show what i know ... :-)

Thanks again, I will upload new versions today,

Earl Brown

unread,
Mar 20, 2013, 11:50:00 PM3/20/13
to corplin...@googlegroups.com
I'm probably splitting hairs here, but this info may help someone who has a huge dataset to save time. I took what I learned in another thread about data.table() (thanks Alex) and applied it dispersions1() to see if it could be quicker. In short, yes, it makes dispersions1() even quicker, but only with huge datasets (which is precisely when you'd want every (milli)second you can get). With non-huge datasets, it's usually a toss up between data.table() and rowsum(), or rowsum() is slightly quicker. Below is a benchmarking test (using the package "rbenchmark") I did with a simulated dataset of 20 million words in 20 thousand files. While rowsum() took 23 seconds to get the dispersions ten times, data.table() took only 14 seconds, or 62% of the time.

> num.files <- 20000
> num.words <- 20000000

>
> aa <- sample(c(TRUE, FALSE), num.words, replace=T)
> bb <- rep(1:num.files, length.out=num.words)
>
> ben.tapply <- function(aa, bb) tapply(aa, bb, sum)
> ben.table <- function(aa, bb) table(aa, bb)[2,]
> ben.rowsum <- function(aa, bb) rowsum(as.numeric(aa), bb)
> ben.data.table <- function(aa, bb) {
+     dt <- data.table(as.numeric(aa), bb)
+     dt[, sum(V1), by = bb][,V1]
+ }
>
> benchmark(ben.tapply(aa, bb), ben.table(aa, bb), ben.rowsum(aa, bb), ben.data.table(aa, bb), replications = 10, columns = c("test", "replications", "elapsed"))
                    test replications elapsed
4 ben.data.table(aa, bb)           10  14.442
3     ben.rowsum(aa, bb)           10  23.205
2      ben.table(aa, bb)           10 131.808
1     ben.tapply(aa, bb)           10 127.688
>
> one <- ben.data.table(aa, bb)
> two <- ben.rowsum(aa, bb)
> three <- ben.table(aa, bb)
> four <- ben.tapply(aa, bb)
>
> identical(as.integer(one), as.integer(two))
[1] TRUE
> identical(as.integer(two), as.integer(three))
[1] TRUE
> identical(as.integer(three), as.integer(four))
[1] TRUE


With only 2 million words in 2 thousand files, rowsum() is slightly quicker than data.table():
                    test replications elapsed
4 ben.data.table(aa, bb)           10   1.371
3     ben.rowsum(aa, bb)           10   1.016
2      ben.table(aa, bb)           10  12.036
1     ben.tapply(aa, bb)           10  12.408

Anyway, hopefully this info will help someone to save time someday when they are working with a huge dataset.

Stefan Th. Gries

unread,
Mar 20, 2013, 11:52:39 PM3/20/13
to corplin...@googlegroups.com
Cool, thanks for testing!

Earl Brown

unread,
Jul 9, 2015, 1:49:39 AM7/9/15
to corplin...@googlegroups.com
Now I know for sure that I'm splitting hairs, but this info may be useful to someone, somewhere, so here goes.

While reading about Rcpp in Hadley Wickham's Advanced R book, I decided to try my hand at some light C++ in order to speed up a script that took several days to run when I ran it a few years ago on a huge corpus with tens of thousands of files. In the script, I use a simplified version of Stefan's dispersions1 function:


which can be seen in my script here:


With two simple C++ functions:

library("Rcpp")
cppFunction('
  int num_matches(CharacterVector corpus, String word) {
    int n = corpus.size();
    int cur_num = 0;
    String cur_wd;
    for (int i = 0; i < n; i++) {
      cur_wd = corpus[i];
        if (cur_wd == word) {
          cur_num++;
        }
      }
      return cur_num;
    }
')

cppFunction('
  IntegerVector match_or_not(CharacterVector corpus, String word) {
    int n = corpus.size();
    String cur_wd;
    IntegerVector output(n);
    for (int i = 0; i < n; i++) {
      cur_wd = corpus[i];
      if (cur_wd == word) {
        output[i] = 1;
      } else {
        output[i] = 0;
      }
    }
    return output;
  }
')

to replace the following (commented) lines within dispersions1():

# f <- sum(corpus==element)
f <- num_matches(corpus, element)

# v <- rowsum(as.numeric(corpus==element), corpus.parts)
v <- rowsum(match_or_not(corpus, element), corpus.parts)

I got a 2x increase in speed (after the first run, which compiles the two C++ functions). 

In addition to these C++ functions, within my script I also replaced scan() with readr::read_lines() and tolower() with stringr::str_to_lower(), both of which are more than twice as fast (on my computer), in order to speed up the loading of the text files in my corpus. 

I imagine that even more speed can be squeezed out of the script (if need be) by replacing grep() and gsub() with their stringr equivalents, str_subset() and str_replace_all(), respectively, as they are more than twice as fast.

Anyway, I thought I'd pass along this tidbit of info for those who might need a bit of a speed up with dispersions1() and/or with a quicker way to load (many) text files.

Stefan Th. Gries

unread,
Jul 9, 2015, 9:19:07 AM7/9/15
to CorpLing with R

Cool, thanks a lot!


STG
--
Stefan Th. Gries
----------------------------------

Univ. of California, Santa Barbara
http://tinyurl.com/stgries
----------------------------------

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

To post to this group, send email to corplin...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages