DP (Gries, 2008, 2010)

76 views
Skip to first unread message

josephsorell

unread,
Apr 28, 2013, 10:57:11 PM4/28/13
to statforli...@googlegroups.com
gReetings!

Please excuse the rather long post and the sometimes tedious comments in my script. I have included my complete script for reference.

I'm analyzing word frequencies in multiple text files. My question has to do with the last two sections in which I calculate range (the number of texts in which a particular word type occurs) and Gries' DP. I'm using the simple, non-normalized version of DP, since I'm analyzing hundreds or thousands of files and each is exactly the same length (10,000 tokens).

Running this script on 705 files (7.05 million tokens) in a 64-bit version of R (2.15.1) on a computer with 8 GB of RAM and an Intel i5 3.40 GHz CPU took approximately 5 hours. When I increased that to a little more than 18 million tokens (1,842 files) it took around three days to finish. My hope is to also run the script on a collection of nearly 50 million. I fear that will take a month to finish.

My question is whether there might be a more efficient way to calculate DP, than the for loop near the end of the script. I looked at Stefan Gries' script at http://www.linguistics.ucsb.edu/faculty/stgries/research/dispersion/_dispersion1.r but couldn't see any way to improve my script (that is any that were obvious to me).

Many thanks for any suggestions!

Joseph


#MultiFile_frequency_range_DP.r
#
library(gtools) #The mixedsort function that is used repeatedly in this script is from the Warnes’ (2012) gtools library.
selected.files <- list.files(path=getwd(), pattern="*.txt") #Selects all the files from the working directory. 
selected.files <- mixedsort(selected.files) #Reorders the files in their proper numeric order.
combined.list = NULL #Initiates the vector to receive the combined list.
length(combined.list) <- 10000000  #preallocates memory space to the vector so that a new space does not need to be found as it is enlarged with each iteration of the loop below.
for (i in 1:length(selected.files)) {
  text.file<-scan(selected.files[[i]], what="char", sep="\n", quote="", comment.char="") #Inputs the text file.
  text.file<-tolower(text.file) #Changes all alphabetic characters to lower case.
  text.file<-gsub("<.*?>", "", text.file, perl=T) #Removes tags.
  word.list<-strsplit(text.file, "\\W+") #Extracts words from the file.
  word.vector<-unlist(word.list) #Changes output from strsplit back into a vector.
  word.vector<-word.vector[nchar(word.vector)>0] #Removes any remaining empty strings.
  freq.list<-table(word.vector) #Creates a table of named integers (word types and their frequencies).
  assign(paste("freq.list",i,sep="."),freq.list) #Creates a unique file name for each iteration of this loop.
  combined.list<-c(combined.list, freq.list) #Adds each frequency list to combined list.
}
combined.freq <- as.table(tapply(combined.list, names(combined.list), sum)) #Transforms the series of individual frequency lists into a unified frequency list.
combined.sorted.freq.list<-sort(combined.freq, decreasing=T) #Sorts the frequency list in descending order of frequency.
all.freq.lists <- ls(pattern="freq.list.\\d{1,4}") #Lists all frequency files in memory with 1-4 digits. Actually, it will list all files with more than that many digits, too.
all.freq.lists <- mixedsort(all.freq.lists) #Reorders the lists numerically.
for (j in 1:length(all.freq.lists)) { #Loop sorts each individual frequency list into the combined order.
  freq.list.j<-get(all.freq.lists[[j]])
  j.in.combined<-freq.list.j[freq.list.j=names(combined.sorted.freq.list)] #Pastes the names of the frequency vector alongside the frequency integers in a matrix.
  j.in.combined[is.na(j.in.combined)] <- 0 #Changes "NA" to numeric 0.
  assign(paste("combined", j, sep="."), j.in.combined) #Writes the current frequency list to file.
}
all.j.in.combined<-ls(pattern="combined.\\d{1,4}") #Lists all combined files in memory with 1-4 digits.
all.j.in.combined <- mixedsort(all.j.in.combined)
combined.file.count<-1:length(selected.files) #Counts the number of files originally selected.
combined.file.list<-paste("combined", combined.file.count, sep=".") #Creates the file names for the combined lists by catenating "combined" with each file number separated by a period.
combined.table<-paste(names(combined.sorted.freq.list), combined.sorted.freq.list, sep="\t") #Creates a table with columns for the words of the combined vocabulary and the total frequency.
for (x in 1:length(combined.file.list)) {
  add.to.table<-get(combined.file.list[[x]])
  combined.table<-paste(combined.table, add.to.table, sep="\t") #Adds each of the component lists to the combined table.
}
header<-paste(selected.files, sep="\t") #Creates header labels for the component frequency lists.
table.header <- c("Word_type", "Total_Frequency", header, "\n") #Adds column labels for the Word_type and total columns. The hard return "\n" at the end will force the first row of the frequency table onto the second line of the file.
cat(table.header, file="word_frequencies.csv", sep="\t") #Saves the header row to a spreadsheet file.
cat(combined.table, file="word_frequencies.csv", sep="\n", append=TRUE) #Adds the table to the spreadsheet file.
rm(list=ls(pattern="freq.list.\\d*|combined.\\d*")) #Removes temporary files used to create the frequency table.
#
#vocabulary range
#
vocab.table <- read.csv("word_frequencies.csv", header=TRUE, sep="\t")
vocab.range = NULL #Initiates the vector to receive the range scores. Range, here, means the number of files in which a word type occurs.
length(vocab.range) <- 900000  #Preallocates memory space to the vector so that a new space does not need to be found as it is enlarged with each iteration of the loop below.
v <- length(names(combined.sorted.freq.list))  #Calculates the number of word types in the combined list
f <- length(combined.file.count)
x <- length(combined.file.count)+2  #Calculates the column number of the final file, i.e. the number of files plus the word_type and total columns.
for (i in 1:v) {
  range.count <- sum(vocab.table[i,3:x] >0)  #Counts number of files in which each type occurs, i.e. frequency is >0.
  vocab.range <- paste(vocab.range, range.count, sep="\n")
}
cat("Range", file="vocab_range.csv")
cat(vocab.range, file="vocab_range.csv", sep="\n", append=TRUE)
#
range.table <-unlist(read.csv("vocab_range.csv", header=TRUE, sep="\t"))
range_freq.table <- paste(names(combined.sorted.freq.list), combined.sorted.freq.list, range.table, sep="\t")
range.freq.table.header <- c("Word_type", "Total_Frequency", "Range", "\n") #Adds column labels for the Word_type and total columns. The hard return "\n" at the end will force the first row of the frequency table onto the second line of the file.
cat(range.freq.table.header, file="word_freq_range.csv", sep="\t")
cat(range_freq.table, file="word_freq_range.csv", sep="\n", append=TRUE)
#
#DP (Gries, 2008, 2010)
#Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. /International Journal of Corpus Linguistics/ 13(4). 403-437.
#Gries, Stefan Th. 2010. Dispersions and adjusted frequencies in corpora: further explorations. In Stefan Th. Gries, Stefanie Wulff, & Mark Davies (eds.), /Corpus linguistic applications: current studies, new directions/, 197-212. Amsterdam: Rodopi.
#
vocab.DP = NULL  #Initiates the vector to receive the range scores.
length(vocab.DP) <- 1000000  #Preallocates memory space to the vector so that a new space does not need to be found as it is enlarged with each iteration of the loop below.
for (j in 1:v) {
  DP.score <- sum(abs((1/f)-vocab.table[j,3:x]/vocab.table[j,2]))/2
  vocab.DP <- paste(vocab.DP, DP.score, sep="\n")
}
cat("DP", file="vocab_DP.csv")
cat(vocab.DP, file="vocab_DP.csv", sep="\n", append=TRUE)
#
DP.table <-unlist(read.csv("vocab_DP.csv", header=TRUE, sep="\t"))
range_freq.DP.table <- paste(names(combined.sorted.freq.list), combined.sorted.freq.list, range.table, DP.table, sep="\t") #Creates a file with the combined list of word types present in the scanned files, each word's total frequency, its range and DP score.
range.freq.DP.table.header <- c("Word_type", "Total_Frequency", "Range", "DP", "\n") #Adds column labels for the Word_type and total columns. The hard return "\n" at the end will force the first row of the frequency table onto the second line of the file.
cat(range.freq.DP.table.header, file="word_freq_range_DP.csv", sep="\t")
cat(range_freq.DP.table, file="word_freq_range_DP.csv", sep="\n", append=TRUE)
rm(list=ls(all=T)) #Clears the memory so files do not interfere with subsequent processes.
#
#End of script

Stefan Th. Gries

unread,
Apr 29, 2013, 8:42:27 PM4/29/13
to statforli...@googlegroups.com
Assuming for the sake of the question that you're working on English,
how often are you computing DP for the word _the_?

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

aleksander dietrichson

unread,
Apr 29, 2013, 10:04:24 PM4/29/13
to statforli...@googlegroups.com
Joseph,
Regarding you analysis that you fear will take a month to finish: I have been very successful running Rstudio server edition on amazon. Get a big machine, there is one with Ubuntu 12.10 for cluster instances which permits you to have 240G ram and 32 cores (hyperthreaded). Things that take hours on my local hardware takes minutes if you run it multi-core. If you use spot instances the mentioned machine is usually only about $0.35 an hour.



--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
----------------------------------------
Aleksander Dietrichson, PhD
Founder and CIO
UVCMS SRL
Buenos Aires: +54(11) 4831-0385/9
New York: +1(646) 775-2914 x104

josephsorell

unread,
Apr 29, 2013, 10:47:07 PM4/29/13
to statforli...@googlegroups.com
Yes, it's English. Sorry, I should've mentioned the language.

For 25 million, the total frequency of "the" in this text type should be about 1.6 million. The number of individual calculations (differences from the expected frequency) would be 2,500 for every word type.

Stefan Th. Gries

unread,
Apr 29, 2013, 11:15:28 PM4/29/13
to statforli...@googlegroups.com
> Yes, it's English. Sorry, I should've mentioned the language.
No problem, I just needed an easy word type for an example.

> For 25 million, the total frequency of "the" in this text type should be about 1.6 million. The number of individual calculations (differences from the expected frequency) would be 2,500 for every word type.
So, does your script, which I didn't study, compute a DP value for
"the" 1.6 million types or 1 time? I.e., does it march through the
corpus doing a DP computation for each token or does it do a DP
computation for each type as in the following artificial example?

##########
rm(list=ls(all=TRUE)); set.seed(1)
corpus.tokens <- c("the", "vector", "is", "the", "whatever")
corpus.types <- unique(corpus.tokens)
DPs <- rep(0, length(corpus.tokens)); names(DPs) <- corpus.tokens

# compute DPs in whatever way
for (eachtype in corpus.types) {
DPs[names(DPs)==eachtype] <- runif(1, 0, 1) # creating a random
DP-value here - you would of course do it with the corpus
}
##########

This way you compute each type's DP value only once, saving probably
tens of millions of computations. This is not to say that the big data
approach isn't better or more powerful - which it is - but if you
don't already to a type-based calculation, then this will shave off
hours, too. If you already do a typed-based calculation of this sort,
just disregard this message, writing this was faster than reading the
script ;-)

HTH,

josephsorell

unread,
Apr 29, 2013, 11:40:47 PM4/29/13
to statforli...@googlegroups.com


On Tuesday, April 30, 2013 11:15:28 AM UTC+8, Stefan Th. Gries wrote:
> Yes, it's English. Sorry, I should've mentioned the language.
No problem, I just needed an easy word type for an example.

> For 25 million, the total frequency of "the" in this text type should be about 1.6 million. The number of individual calculations (differences from the expected frequency) would be 2,500 for every word type.
So, does your script, which I didn't study, compute a DP value for
"the" 1.6 million types or 1 time? I.e., does it march through the
corpus doing a DP computation for each token or does it do a DP
computation for each type as in the following artificial example?

My script calculates DP from a frequency table created by the first part of the script. (That part was developed from what I learned in your Quantitative Corpus Linguistics with R.)
f is the number of individual files
The variable j is each word type. The script marches through the table calculating DP for each word type.
It starts with column 3 in the vocabulary table (since col 1 is the word type and col 2 is the total frequency) and goes to x (number of files +2 to give the correct column number for the last file.)
It then adds each DP score to a table that will be saved as a csv file.

#
vocab.DP = NULL  #Initiates the vector to receive the range scores.
length(vocab.DP) <- 1000000  #Preallocates memory space to the vector so that a new space does not need to be found as it is enlarged with each iteration of the loop below.
for (j in 1:v) {
  DP.score <- sum(abs((1/f)-vocab.table[j,3:x]/vocab.table[j,2]))/2
  vocab.DP <- paste(vocab.DP, DP.score, sep="\n")
}
cat("DP", file="vocab_DP.csv")
cat(vocab.DP, file="vocab_DP.csv", sep="\n", append=TRUE)
#######

Stefan Th. Gries

unread,
Apr 29, 2013, 11:42:35 PM4/29/13
to statforli...@googlegroups.com
ok, sorry, then my comments were not pertinent. Then, maybe back to
Aleksander's proposals ;-)

josephsorell

unread,
Apr 30, 2013, 2:17:40 AM4/30/13
to statforli...@googlegroups.com
The online R server sounds promising if I can figure out how to set it up.
Thanks!

josephsorell

unread,
Apr 30, 2013, 4:49:23 AM4/30/13
to statforli...@googlegroups.com, sa...@uvcms.com
This seems like the way to go. Thanks! 
Would you upload the data to the cloud before processing?


Reply all
Reply to author
Forward
0 new messages