Stacking lists of n-grams from several text files into a single data frame

33 views
Skip to first unread message

Christophe Bechet

unread,
Apr 11, 2018, 7:31:18 PM4/11/18
to CorpLing with R
Dear QuantCorpLingR users,

I'm having a hard time trying to create a stacked table of n-grams for several corpus files. The idea is to have a column with the ID of the text files and another column with a list of all the trigrams. As you already guess, the aim is to use n-grams in order to create clusters of texts and see whether their classification is in line with their genre classification in the Corpus of Late Modern English texts (texts produced between 1850 and 1920). Here follows the code which I've managed to write so far:

rm(list=ls(all=T))


# set the working directory so that you can access the corpus files you want to process


setwd
(choose.dir())
getwd
() # "F:/My files/Corpora/English/clmet3_0/plain_text/1850-1920"


# write a function that can create n-grams


word
.ngrams <- function (input.vector, gram.length) {
  output
<- apply( # use the matrix that
    mapply
( # you get from using mapply
      seq
, # to generate sequences that
     
1:(length(input.vector)-(gram.length-1)), # begin at 1, 2, ...
      gram
.length:length(input.vector)          # and end with the relevant gramlengths
   
), # use the values from that matrix
   
2, # in a columnwise fashion
   
function (items) { # an inline/anonymous function that
      paste
(input.vector[items], # to access and then paste together the subsetted words
            collapse
=" ")        # with spaces between them
   
} # end of inline/anonymous function
 
) # end of apply(...)
 
return(output) # return the output object just created
} # end of function definition


# select the corpus files


corpus
.files <- list.files(path = getwd(), pattern = "\\.txt$", full.names = TRUE) # same directory as above
head
(corpus.files) # shows the full path to the first six corpus files


length
(corpus.files)


# generate a table of trigrams in a for loop ...


for (i in length(corpus.files)) { # enter the for loop
 
# load the current corpus file
  current
.corpus.file <- tolower(scan(corpus.files[i], # load the ith file into current.corpus.file
                              what
= character(), # as a character vector
                              sep
= "\n", # with linebreaks as separators between vector elements
                              quote
="", comment.char = "",
                              quiet
= TRUE)) # convert the file to lower case
  clean
.current.corpus.file <- gsub(".*?<.+?>", "", current.corpus.file, ignore.case = TRUE) # remove the metadata from the file
 
# create a vector of all words in the corpus
  textfile
.words <- strsplit(clean.current.corpus.file, # split up the vector clean.current.corpus.file
                             
"[^a-z]+", # at 1+ occurrences of characters that are not letters a-z
                             perl
= TRUE) # using Perl-compatible regular expressions
  textfile
.words <- unlist(textfile.words) # change the list into a vector
 
# remove empty character strings
  textfile
.words <- textfile.words[nzchar(textfile.words)]
  trigrams
<- word.ngrams(textfile.words, 3)
  current
.trigrams <- paste(basename(corpus.files[i]), trigrams, sep = "\t") # prefix the name of the corpus file to each vector of trigrams and store the results in a vector
  write
.table(current.trigrams, file = choose.files(), quote=F, sep="\t", row.names=F, append = TRUE) # output the results in a tab-delimited file
}

When I run the code, what I get is a table with 242,108 trigrams, but the file ID remains the same throughout, nl. the ID of the last file in the corpus (CLMET3_0_3_333.txt), and the list of trigrams corresponds to the trigrams found in this file only:

CLMET3_0_3_333.txt punch or the
CLMET3_0_3_333.txt or the london
CLMET3_0_3_333.txt the london charivari
CLMET3_0_3_333.txt london charivari vol
CLMET3_0_3_333.txt charivari vol august
CLMET3_0_3_333.txt vol august pg
CLMET3_0_3_333.txt august pg modern
CLMET3_0_3_333.txt pg modern types
CLMET3_0_3_333.txt modern types by
CLMET3_0_3_333.txt types by mr
...................................................................................................

The problem can certainly be solved by changing the way of stacking the data, but I just don't know how to proceed. Any help would be much appreciated here.

I also plan to append columns with metadata (genre, year, sex, etc.) in later stages as an improvement. I attach three corpus files to the mail so that anyone interested would be able to reproduce the example.

Thanking you in advance,

C. B.
CLMET3_0_3_188.txt.txt
CLMET3_0_3_260.txt.txt
CLMET3_0_3_333.txt.txt

Stefan Th. Gries

unread,
Apr 11, 2018, 8:41:03 PM4/11/18
to CorpLing with R
When you do

for (i in length(corpus.files)) {

i is only ever 3 (because you use length), never 1:3 (because you
don't use seq), and you don't see this because you have no progress
report in the loop. The following works:

###########################################
all.trigrams <- c()
for (i in seq(corpus.files)) { cat(i, "\n")
current.corpus.file <- tolower(scan(corpus.files[i],
what=character(), sep="\n", quote="", comment.char="",
fileEncoding="ISO-8859-1", quiet = TRUE))
clean.current.corpus.file <- gsub(".*?<.+?>", "", current.corpus.file)
textfile.words <- unlist(strsplit(clean.current.corpus.file,
"[^a-z]+", perl=TRUE))
textfile.words <- textfile.words[nzchar(textfile.words)]
trigrams <- word.ngrams(textfile.words, 3)
all.trigrams <- c(all.trigrams, paste(basename(corpus.files[i]),
trigrams, sep="\t"))
}
all.trigrams[1:3]
all.trigrams[142162:142164]
all.trigrams[162582:162584]
###########################################

Best,
STG

Christophe Bechet

unread,
Apr 12, 2018, 8:24:44 AM4/12/18
to CorpLing with R
Thank you for the quick reaction, Stefan. It was quite late when I tried the code and I didn't pay attention when specifying the way each corpus file had to be processed.

Jerid Francom

unread,
Apr 13, 2018, 10:19:59 AM4/13/18
to CorpLing with R
Hi Christophe,

It seems you've got good feedback here, but I wanted to suggest a different approach to keeping your meta-data (file names in this case) and tokens aligned. Your desired output is precisely the aim of the 'tidy' approach to R. 

In this approach I assume the following directory structure. 

.
├── create_trigrams.R
└── data
   
├── derived
   
  └── trigrams.csv
   
└── original
       
├── CLMET3_0_3_188.txt.txt
       
├── CLMET3_0_3_260.txt.txt
       
└── CLMET3_0_3_333.txt.txt

The code in `create_trigrams.R` is the following: 

First load the required packages
# SETUP -------------------------------------------------------------------

library
(tidyverse) # for data manipulation and piping (%>%)
library
(tidytext) # for tokenization
library
(readtext) # for reading text and meta-data


Then read the data with `readtext`
# Read data ---------------------------------------------------------------

text
<-
  readtext
(file = "data/original/*", # read all files
           encoding
= "ISO-8859-1") %>% # set the file encoding
  as_tibble
() # coerce into tibble data.frame format


Next, clean up the data removing inline meta-data
# Clean data --------------------------------------------------------------

clean_text
<-
  text
%>% # pass `text` data.frame
  unnest_tokens
(lines, text, token = "lines") %>% # tokenize by lines
  filter
(!str_detect(lines, "^<")) %>% # remove meta-data
  group_by
(doc_id) %>% # group by `doc_id`
  summarise
(text = str_c(lines, collapse = " ")) %>%  # merge `lines`
  mutate
(text = str_trim(text)) # remove leading/ trailing whitespace


Create the trigrams
# Tokenize ----------------------------------------------------------------

trigrams
<-
  clean_text
%>%
  unnest_tokens
(trigrams, text, token = "ngrams", n = 3)


`trigrams` now contains your desired format
# A tibble: 400,073 x 2
   doc_id                 trigrams                    
   
<chr>                  <chr>                      
 
1 CLMET3_0_3_188.txt.txt the four chapters          
 
2 CLMET3_0_3_188.txt.txt four chapters of            
 
3 CLMET3_0_3_188.txt.txt chapters of which          
 
4 CLMET3_0_3_188.txt.txt of which this              
 
5 CLMET3_0_3_188.txt.txt which this work            
 
6 CLMET3_0_3_188.txt.txt this work consists          
 
7 CLMET3_0_3_188.txt.txt work consists originally    
 
8 CLMET3_0_3_188.txt.txt consists originally appeared
 
9 CLMET3_0_3_188.txt.txt originally appeared as      
10 CLMET3_0_3_188.txt.txt appeared as four    


Finally, write this data to `trigrams.csv`
# Write data --------------------------------------------------------------

write_csv
(trigrams, path = "data/derived/trigrams.csv")


As a side note, if you are gathering data from Project Gutenberg, I suggest taking a look at the `gutenbergr` package for automating the process. 

Best,

J
Message has been deleted

Jerid Francom

unread,
Apr 13, 2018, 10:25:13 AM4/13/18
to CorpLing with R

Christophe Bechet

unread,
Apr 13, 2018, 10:32:02 AM4/13/18
to corplin...@googlegroups.com
Thanks, Jerid! I tested gutenbergr some time ago and I'm not yet familiar with the tidyverse. Maybe I'll catch up with this later!

Best,

C. B.

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages