Frequency numbers of verbs only

105 views
Skip to first unread message

Earl Brown

unread,
Mar 14, 2013, 11:27:03 PM3/14/13
to corplin...@googlegroups.com
Hello Corpus Linguistics Rists. I'm having trouble retrieving frequency numbers of words from one data frame and putting them into another data frame. Here's a simple example:

# frequency list data frame
freq.list <- data.frame(WORD = c("address", "address", "record", "record"), POS = c("noun", "verb", "noun", "verb"), FREQ = c(12, 23, 34, 45))

     WORD  POS FREQ
1 address noun   12
2 address verb   23
3  record noun   34
4  record verb   45

# tokens data frame
tokens <- data.frame(CASE = c(1,2,3,4,5,6), TOKEN = c("record", "junk", "address", "more junk", "record", "even more junk"))

  CASE          TOKEN
1    1         record
2    2           junk
3    3        address
4    4      more junk
5    5         record
6    6 even more junk

# I'd like to get the freq numbers for verbs, but this line returns that of nouns
tokens$FREQ <- freq.list$FREQ[match(tokens$TOKEN, freq.list$WORD)]

So, my question for you is how to take the frequency number of verbs. I tried this:
freq.list$FREQ[(match(tokens$TOKEN, freq.list$WORD) && freq.list$POS %in% "verb")]

but that didn't work. A work around is to subset the frequency list data frame before running the line with match():
sub.freq.list <- freq.list[freq.list$POS %in% "verb",]
tokens$FREQ <- sub.freq.list$FREQ[match(tokens$TOKEN, sub.freq.list$WORD)]

but I'm wondering if there is a way to specify both conditions ((1) tokens$TOKEN == freq.list$WORD and (2) freq.list$POS %in% "verb") in the call to match()?

Thanks for your help. Earl Brown

Alex Perrone

unread,
Mar 15, 2013, 12:35:06 AM3/15/13
to corplin...@googlegroups.com
probably a hack, but…

getFreq <- function(token, pos, freq.list) { 
ind <- which(freq.list$WORD==token & freq.list$POS==pos)
if (length(ind) == 1) {
return(freq.list$FREQ[ind])
} else {
return(NA)
}
}

# Test cases
getFreq("record","noun",freq.list)
getFreq("record","verb",freq.list)
getFreq("junk","verb",freq.list)

# Make new columns in tokens
# as.character(tokens$TOKEN) needed to convert from factor to strings
tokens$FREQ.NOUN <- sapply(as.character(tokens$TOKEN), FUN=getFreq, "noun", freq.list)

tokens$FREQ.VERB <- sapply(as.character(tokens$TOKEN), FUN=getFreq, "verb", freq.list)



--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at http://groups.google.com/group/corpling-with-r?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Earl Brown

unread,
Mar 15, 2013, 12:47:59 PM3/15/13
to corplin...@googlegroups.com
Hack or not, that's a cool solution.

I modified it to handle (malformed) freq lists that have two or more rows with the same word and the same part of speech tag:


getFreq <- function(token, pos, freq.list) {
    ind <- which(freq.list$lemma == token & freq.list$PoS == pos)
    if (any(ind) & length(ind) != 1) {
        stop("The freq list has more than one \"", token, "\" coded as \"", pos, "\"")
    } else    if (length(ind) == 1) {
        return(freq.list$freq[ind])
    } else {
        return(NA)
    }
}

Thanks.

Stefan Th. Gries

unread,
Mar 15, 2013, 12:55:05 PM3/15/13
to corplin...@googlegroups.com
This (and similar problems) benefit from a more general approach. It
would be nicer if, for (malformed) freq lists that have two or more
rows with the same word and the same part of speech tag, the function
returned all the multiple hits. This will require that the output is
not a vector (with one frequency for each token) but a list with as
many elements as there are tokens and then each list elements contains
all the frequencies for that token (which better be 1 most of the time
;-)) I wrote scripts to do that to retrieve, for example,
pronunciations from the CELEX database. Most of the time, you get one
pronunciation (like you want one frequency), but then when when there
are two pronunciations, you want to know that, too.

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Earl Brown

unread,
Mar 16, 2013, 12:07:37 AM3/16/13
to corplin...@googlegroups.com
As a (benchmarking) side note:

With 10,000 tokens and a freq list of 20,000 it was taking 45 seconds to get through my data with sapply(). I changed it to unlist(mclapply()) from library("parallel") and reduced the time by almost half (as my machine has two cores; with four cores, for example, the time would presumably be a fourth, etc.):

> system.time({
+ aa <- sapply(as.character(imp$LEMMA), FUN = getFreq, "v", dav.list)
+ })
   user  system elapsed
 43.455   1.326  44.524
>
> library("parallel")
> system.time({
+ bb <- unlist(mclapply(as.character(imp$LEMMA), FUN = getFreq, "v", dav.list))
+ })
   user  system elapsed
 53.430   1.654  27.759
>
> identical(as.integer(aa), as.integer(bb))
[1] TRUE

Alex Perrone

unread,
Mar 16, 2013, 12:12:14 AM3/16/13
to corplin...@googlegroups.com
I think the data frame lookups are slow (comparing all the elements to see if they are equal to your token, over and over). I think if you use a data.table (package data.table) speed would not be a problem.


Alex Perrone

unread,
Mar 16, 2013, 12:47:00 PM3/16/13
to corplin...@googlegroups.com
Earl, 

In other words, try out this code. It is *much* more concise (doing your queries is done in one line each) and I suspect much faster when you scale it up. 

-- Alex

# Load libraries
library("data.table")

# frequency list data frame
freq <- data.frame(WORD = c("address", "address", "record", "record"), 
POS = c("noun", "verb", "noun", "verb"), 
FREQ = c(12, 23, 34, 45),
stringsAsFactors=FALSE)

# tokens data frame
tokens <- data.frame(CASE = c(1,2,3,4,5,6), 
TOKEN = c("record", "junk", "address", "more junk", "record", "even more junk"),
stringsAsFactors=FALSE)

# as data.tables
freq.table <- data.table(freq)
tokens.table <- data.table(tokens)

# set keys, very important!! 
setkey(freq.table# sort table using all the columns in original order
setkey(tokens.table

# Add columns to tokens.table 
tokens.table[,FREQ.NOUN:=freq.table[J(TOKEN,"noun")]$FREQ]
tokens.table[,FREQ.VERB:=freq.table[J(TOKEN,"verb")]$FREQ]

Earl Brown

unread,
Mar 17, 2013, 12:58:51 AM3/17/13
to corplin...@googlegroups.com
Alex, I can get it to work on the simple example I gave, but not on my real data. I get an error message:

Error in `[.data.table`(freq.table, J(LEMMA, "v")) :
  typeof x.rank (integer) != typeof i.V1 (character)

Verbs are coded as "v" in the freq list column labeled "PoS".

Here's all my code, as well as str() of the tokens and freq data tables, at the bottom.

> rm(list=ls(all=T))
> library("data.table")
>
> # loads data
> tokens <- read.table("/Users/earlbrown/Documents/Imperfect_subjunctive_forms/Data_imp_sub_novels_Habla_Culta.csv", header = T, sep = "\t", quote = "", na.strings = "", stringsAsFactors = F)
>
> # loads freq list
> freq <- read.table("/Users/earlbrown/Sites/Dict_Span_Port/Freq_list_Span.txt", header=T, sep="\t", encoding="latin1", stringsAsFactors = F)
>
> # converts to data.table and sets keys
> tokens.table <- data.table(tokens)
> freq.table <- data.table(freq)
> setkey(tokens.table)
> setkey(freq.table)
>
> # gets frequencies
> tokens.table[,FREQ.LEM:=freq.table[J(LEMMA, "v")]$freq]   
Error in `[.data.table`(freq.table, J(LEMMA, "v")) :
  typeof x.rank (integer) != typeof i.V1 (character)
>
> str(tokens.table)
Classes ‘data.table’ and 'data.frame':    10311 obs. of  14 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ GENRE   : chr  "written" "written" "written" "written" ...
 $ COUN    : chr  "Venezuela" "Venezuela" "Venezuela" "Venezuela" ...
 $ FILE    : chr  "Barro" "Barro" "Barro" "Barro" ...
 $ WD.NUM  : int  38 138 338 611 667 684 944 1043 1243 1607 ...
 $ PRE     : chr  "un auténtico gorro negro y carrasposo encajado en la frente y me dijo como si" "mientras iba bajando la escalera y haciendo me el sordo. Es posible que no" "aunque mucho menos metido en el papel, más suavizado y más benigno como si" "silencio dónde se supone que va a entrar de inmediato mi respuesta ; pero así" ...
 $ HIT     : chr  "estuviéramos" "advirtieran" "fuera" "hubiera" ...
 $ FOL     : chr  "hablando de lo mismo desde hace rato : \" dan ganas de ir se a" "aquella escapada, aunque sé que la mujer de el pijama amarillo me vio cuando" "el propio Raymon_Burr en el papel de el Jefe ya retirado y arrepentido de todas" "tenido que esperar medio siglo, pues todo lo que hago en su favor," ...
 $ LEMMA   : chr  "estar" "advertir" "ser" "haber" ...
 $ PRE.WD  : chr  "si" "no" "si" "así" ...
 $ FOL.WD  : chr  "hablando" "aquella" "el" "tenido" ...
 $ TYPE    : chr  "ra" "ra" "ra" "ra" ...
 $ POL     : chr  "pos" "neg" "pos" "pos" ...
 $ COMPOUND: chr  "comp" "sim" "sim" "comp" ...
 - attr(*, ".internal.selfref")=<externalptr>
 - attr(*, "sorted")= chr  "CASE" "GENRE" "COUN" "FILE" ...
>
> str(freq.table)
Classes ‘data.table’ and 'data.frame':    19506 obs. of  4 variables:
 $ rank : int  1 2 3 4 5 6 7 8 9 10 ...
 $ freq : int  1173068 1146659 929858 579175 543262 516157 448409 398045 341858 341354 ...
 $ lemma: chr  "el" "de" "la" "que" ...
 $ PoS  : chr  "l" "e" "l" "c" ...
 - attr(*, ".internal.selfref")=<externalptr>
 - attr(*, "sorted")= chr  "rank" "freq" "lemma" "PoS"
>

Thanks for your help. data.table()-s look promising. Earl

Alex Perrone

unread,
Mar 17, 2013, 1:08:07 AM3/17/13
to corplin...@googlegroups.com
Never seen that one before, I'm kind of new to data.tables myself, could try the data.table mailing list?


or elsewhere…

Alex Perrone

unread,
Mar 17, 2013, 1:27:57 AM3/17/13
to corplin...@googlegroups.com
it appears your "v" is being queried on the rank column because that is the first (unnamed) column in the freq.table -- and the rank is a column of integers, not characters. i think you need to switch around the columns so they match up in both data.tables, so that when you query LEMMA it will look at lemma and "v" will look at PoS. it might also suffice to set the keys in the order you want using setkey or setkeyv and providing a character vector, but unsure.

just be careful not to do things like freq.table$PoS=="v" because that is a vector scan as opposed to a binary search, so you lose all speed benefits of a data.table. this is why the J syntax is used (their intro documentation vignette explains this well). 



On Mar 17, 2013, at 12:58 AM, Earl Brown <ekbr...@gmail.com> wrote:

Matt the Accountant

unread,
Mar 17, 2013, 6:08:18 PM3/17/13
to corplin...@googlegroups.com
I use the data.table with some frequency (and love it).  Alex is correct: you are attempting to query an integer column with a character, and that is why you are getting the error.  As I understand and use it, J() is intended to match the given value against the table's key using binary search.  So, if you want to query your freq.table for fields where the "lemma" column is "v", you could:

setkey(freq.table,lemma)
freq.table[J("v")]

It looks like this modification (setting the key and querying it directly) would fix your issue, if Alex's sample query gives you what you want.


Matt

Earl Brown

unread,
Mar 18, 2013, 1:18:42 AM3/18/13
to corplin...@googlegroups.com
Success! (on getting the data.table code to work) but another bug (with accented vowels and "ñ").

I used setkeyv() to set the keys on the character columns:

> # sets keys
> setkeyv(tokens.table, "LEMMA")
> setkeyv(freq.table, c("lemma", "PoS"))

and got Alex's data.table code to work correctly with my big data set and freq list. However, I noticed that the result of Alex's data.table() code:


> tokens.table[,FREQ.LEM:=freq.table[J(LEMMA, "v")]$freq]

didn't give me the exact same result as Alex's original sapply() code:
tokens$FREQ.LEM <- sapply(as.character(tokens$LEMMA), FUN = getFreq, "v", freq)

even after resorting the data.table into the original order with order().

I narrowed the problem down to accented vowels and "ñ", as in "oír" 'to hear' and "enseñar" 'to teach'. When the verb lemma had one of those letters, the data.table code returned "NA" while the sapply() code correctly returned the frequency number from the other data table. The accented vowels and the "ñ" didn't bother sapply() but caused the data.table approach to return "NA".

Any ideas on this? Could it be something with the inner workings of data.frame that I / we cannot change (easily)? I assume data.table supports Unicode, but maybe not (?).

As always, thanks for your help. I appreciate your time. Earl

Earl Brown

unread,
Mar 19, 2013, 11:21:51 PM3/19/13
to corplin...@googlegroups.com
Encoding issues rear their ugly head again. The freq list was encoded as "latin1" while my tokens file was encoded as "UTF-8". Even using iconv() on the character columns in the freq list didn't fix the problem. I had to simply resave the freq list with "UTF-8" encoding using LibreOffice before pulling it into R with read.table(). A simple fix indeed.

Getting back to benchmarking, yeah, data.table is (way (much)) faster. Compared to sappy() it's 1226 times quicker, and compared to mclapply() it's 857 times quicker, at least with my data file of 10,000 tokens and my freq list of 20,000 rows. Take a look:

> ###
> # option 1: sapply()
> system.time({
+     tokens$FREQ.LEM.SAPPLY <- sapply(as.character(tokens$LEMMA), FUN = getFreq, "v", freq)

+ })
   user  system elapsed
 24.452   1.494  25.747
>
> # option 2: mclapply()
> library("parallel")
> system.time({
+     tokens$FREQ.LEM.MCLAPPLY <- unlist(mclapply(as.character(tokens$LEMMA), FUN = getFreq, "v", freq))    

+ })
   user  system elapsed
 33.154   1.561  17.989
>
> # option 3: data.table
> library("data.table")
> system.time({
+     tokens.table <- data.table(tokens)
+     freq.table <- data.table(freq)
+     
+     # sets keys
+     setkeyv(tokens.table, "LEMMA")
+     setkeyv(freq.table, c("lemma", "PoS"))
+     
+     # gets frequencies
+     tokens.table[,FREQ.LEM:=freq.table[J(LEMMA, "v")]$freq]    
+     
+     # reorders to original order
+     order.index <- order(tokens.table$CASE)
+     tokens.table <- tokens.table[order.index,]

+ })
   user  system elapsed
  0.016   0.005   0.021
>
> identical(as.integer(tokens$FREQ.LEM.SAPPLY), as.integer(tokens$FREQ.LEM.MCLAPPLY))
[1] TRUE
> identical(as.integer(tokens$FREQ.LEM.MCLAPPLY), as.integer(tokens.table$FREQ.LEM))
[1] TRUE

Rik Vosters

unread,
Mar 21, 2013, 3:32:17 AM3/21/13
to corplin...@googlegroups.com
Hi everyone,

Here is a small sample of what my data look like:

speaker = c("N0005", "N0012", "N0101", "N0014", "N0036", "N0014", "N0005", "N0005", "N0031")
a = c("N00059", "N00059", "N01019", "N00059", "N00181", "N00059", "N00059", "N00059", "N00206")
b = c("N00112", "N00120", "N01020", "N00143", "N00241", "N00147", "N00147", "N00149", "N00316")
c = c("NA", "NA", "N01021", "NA", "N00363", "NA", "NA", "NA", "N00318") 
df = data.frame(speaker, a, b, c); df

Factor 'speaker' shows the speaker codes containing only the first five characters. Normally, these speaker codes should be six characters, but the CGN corpus software seems to cut off the last digit. Factors a, b and c (actually several more, but just three for the example) show the participants in the conversation, without the last digit cut off -- i.e. the speaker is among these, with his/her full six-character speaker code.

I now want to see if the five-character string from 'speaker' matches the first five characters from a, b or c. If it matches just one string, I want that six-character string from a, b or c to be added to a new factor 'speaker2'. If it matches two or more strings (as in case 9, where "N0031" could match both "N00316" and "N00318"), I want it to add "ambiguous" to that new factor 'speaker2'.

In other words, I want to know how I can get this output:

speaker2 = c("N00059", "N00120", "N01019", "N00143", "N00363", "N00147", "N00059", "N00059", "ambiguous")
df_result = data.frame(df, speaker2); df_result

I tried this several ways with ifelse(), within(), pmatch() and gregexpr(), but I can never seem to get it to work entirely. There must be a pretty easy way to do this, but I can't seem to come up with it. Anyone's got any ideas? 

Best wishes,

Rik

---
Dr. Rik Vosters

Postdoctoraal onderzoeker
Nederlandse taalkunde
Vrije Universiteit Brussel

Doctor-assistent
Nederlands en algemene taalkunde
Erasmushogeschool Brussel

Rik.V...@vub.ac.be
http://homepages.vub.ac.be/~rvosters/
---


Stefan Th. Gries

unread,
Mar 21, 2013, 9:02:47 AM3/21/13
to corplin...@googlegroups.com
unique.abc <- sort(unique(c(a,b,c))); unique.abc
speaker2 <- rep(NA, length(speaker))

for (i in seq(speaker)) {
speaker2[i] <- paste(grep(substr(speaker[i],1,5), unique.abc,
value=TRUE), collapse="_")
}
# if you really want "ambiguous" and not the candidates:
speaker2[grepl("_", speaker2)] <- "ambiguous"

HTH,

Rik Vosters

unread,
Mar 21, 2013, 11:22:19 AM3/21/13
to corplin...@googlegroups.com
Thanks, Stefan! I still had to modify the code slightly to make it look for possible candidates in the corresponding row of the a, b and c factors only (as opposed to in the a, b and c factors overall, with unique() ), but it works like a charm! 

speaker2[i] <- paste(grep(substr(speaker[i],1,5), c(a[i], b[i], c[i]),
                            value=TRUE), collapse="_")

Best,

Rik

---
Dr. Rik Vosters

Postdoctoraal onderzoeker
Nederlandse taalkunde
Vrije Universiteit Brussel

Doctor-assistent
Nederlands en algemene taalkunde
Erasmushogeschool Brussel

Rik.V...@vub.ac.be
http://homepages.vub.ac.be/~rvosters/
---

Earl Brown

unread,
Mar 22, 2013, 11:47:57 AM3/22/13
to corplin...@googlegroups.com
I asked a question on the data.table help list and learned that as.integer() is (slightly) quicker than as.numeric() with data tables in huge datasets. Take a look:

> library("data.table")
> library("rbenchmark")
>
> # generates random data
> num.files <- 20000
> num.words <- 20000000
> logical.vector <- sample(c(TRUE, FALSE), num.words, replace=T)
> file.names <- rep(1:num.files, length.out=num.words)
>
> # defines functions
> benDTInt <- function(aa, bb) {
+     dt <- data.table(as.integer(aa), bb)
+     dt[,sum(V1), by = bb][,V1]
+ }
>
> benDTNum <- function(aa, bb) {
+     dt <- data.table(as.numeric(aa), bb)
+     dt[,sum(V1), by = bb][,V1]
+ }
>
> benRowsumInt <- function(aa, bb) rowsum(as.integer(aa), bb)
>
> benRowsumNum <- function(aa, bb) rowsum(as.numeric(aa), bb)
>
> benTapply <- function(aa, bb) tapply(aa, bb, sum)
>
> # runs benchmarking
> benchmark(benTapply(logical.vector, file.names), benRowsumInt(logical.vector, file.names), benRowsumNum(logical.vector, file.names), benDTInt(logical.vector, file.names), benDTNum(logical.vector, file.names), replications = 10, columns = c("test", "replications", "elapsed"))
                                      test replications elapsed
4     benDTInt(logical.vector, file.names)           10  12.653
5     benDTNum(logical.vector, file.names)           10  14.141
2 benRowsumInt(logical.vector, file.names)           10  21.133
3 benRowsumNum(logical.vector, file.names)           10  21.134
1    benTapply(logical.vector, file.names)           10 127.768
>
> # tests for sameness among results
> one <- benTapply(logical.vector, file.names)
> two <- benDTInt(logical.vector, file.names)
> three <- benDTNum(logical.vector, file.names)
> four <- benRowsumInt(logical.vector, file.names)
> five <- benRowsumNum(logical.vector, file.names)
> identical(as.integer(one), as.integer(two))
[1] TRUE
> identical(as.integer(two), as.integer(three))
[1] TRUE
> identical(as.integer(three), as.integer(four))
[1] TRUE
> identical(as.integer(four), as.integer(five))
[1] TRUE
Reply all
Reply to author
Forward
0 new messages