How to speed up retrieval from the BNC?

Christophe Bechet

unread,

Apr 13, 2018, 5:07:06 PM4/13/18

to CorpLing with R

Hi,

I'm reproducing an example from QCLWR (1st edition), nl. "A Frequency List of Tag-word sequences from an Annotated Corpus" (pp. 114-118), but instead of processing the two files suggested in the book, I'm trying to run the code for the entire BNC (4049 corpus files, 696 kb). Here follows what I did.

library(gsubfn)


# choose the directory from which the corpus files are to be loaded


setwd(choose.dir()) # "F:/Corpora/English/2554/2554/download/Texts"


# load the corpus files and iterate the procedure over all the sub-directories


corpus.files <- list.files(getwd(), recursive=TRUE, full.names=TRUE, pattern="\\.xml$") # 4049 elements
head(corpus.files); tail(corpus.files)


# reserve a data structure for the whole corpus


whole.corpus<-vector()


# load each corpus file into a vector called current.corpus.file and change all corpus lines into lower case


for(i in corpus.files) {
  current.corpus.file<-tolower(scan(i, what="char", sep="\n", quiet=T))
  # cat(basename(i), "\n") # output a 'progress report'
  current.sentences<-grep("<s n=", current.corpus.file, perl=T, value=T) # tell R not to include the header, utterance tags, etc. in our counts
  current.sentences<-sub("<s n=.*?>", "", current.sentences, perl=T) # tell R not to use the sentence number in our counts
  whole.corpus<-append(whole.corpus, current.sentences) # append the results of these operations to the vector whole.corpus
}

It takes quite a lot of time for R to process all the files to retrieve the sentences lines, which is not surprising given the amount of data R has to process. However, after 3hrs the retrieval is not yet finished. Isn't there a way to speed up the whole thing?

Stefan Th. Gries

unread,

Apr 13, 2018, 5:52:04 PM4/13/18

to CorpLing with R

Why are you saving all sentences when you want to create a frequent list? That's not necessary and not efficient! And check out QCLWR2 ;-)

STG
--
Stefan Th. Gries
----------------------------------
Univ. of California, Santa Barbara
http://tinyurl.com/stgries
----------------------------------

Christophe Bechet

unread,

Apr 13, 2018, 6:52:34 PM4/13/18

to CorpLing with R

I've got QCLWR2 in Kindle version, but I haven't checked all the chapters yet. Well, the idea is not really to create a frequency list, but the basics were the same. I just need to produce a list of n-grams on the basis of data from the BNC.

Best,

C. B.

Stefan Th. Gries

unread,

Apr 13, 2018, 6:56:34 PM4/13/18

to CorpLing with R

Well, the second edition provides a few examples which in size are more comparable to what you have in mind, i.e. using the whole BNC or something as big as that. It is important, though, to give examples that resemble what you want to do. If you want to generate anything having to do with frequencies, then it is more prudent to save interim results already in the form of a frequency table or something like that where the world is only stalled once together with its frequency as opposed to storing the same word multiple times because you're keeping the whole sentence in memory so for those kinds of questions it is important

Best,

STG
--
Stefan Th. Gries
----------------------------------
Univ. of California, Santa Barbara
http://tinyurl.com/stgries
----------------------------------

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.

Christophe Bechet

unread,

Apr 13, 2018, 6:58:14 PM4/13/18

to CorpLing with R

Now I see. Section 5.2.8 of QCLWR2 is devoted to Word-Tag combinations and the issue of memory :-D

Le vendredi 13 avril 2018 23:52:04 UTC+2, Stefan Th. Gries a écrit :

Stefan Th. Gries

unread,

Apr 13, 2018, 7:06:47 PM4/13/18

to CorpLing with R

Yup :-))

--

Christophe Bechet

unread,

Apr 14, 2018, 5:36:43 AM4/14/18

to CorpLing with R

The following piece of code doesn't work for me:

# define the directory where the BNC files are located and get the file names
corpus.files <- dir(rchoose.dir(), full.names = TRUE, recursive = TRUE) # or else 

# corpus.files <- dir(rchoose.dir(), full.names = TRUE, recursive = TRUE, pattern = "\\.xml$") # to get at the files with '.xml' extension only.

It may be due to the native hierarchy of the BNC directories, which looks like the one of the BNC baby (see attachment). The following works fine:

corpus.files <- list.files(getwd(), recursive=TRUE, full.names=TRUE, pattern="\\.xml$")

Any idea why the alternative with rchoose.dir does not work?

BNC_baby_hierarchy.png

Stefan Th. Gries

unread,

Apr 14, 2018, 8:44:39 AM4/14/18

to CorpLing with R

What does that mean, "doesn't work?" Nothing happens? Error message? Your computer shuts down and packs its bag to go to Hawaii? You need to give us a little more to work with ... ;-)

Christophe Bechet

unread,

Apr 14, 2018, 12:51:59 PM4/14/18

to CorpLing with R

It returned an empty character vector.

Stefan Th. Gries

unread,

Apr 14, 2018, 1:48:37 PM4/14/18

to CorpLing with R

Works fine here (Linux system, with the BNC XML in its original folder
structure).

Reply all

Reply to author

Forward