Speeding up XML corpus access

15 views

Skip to first unread message

Dimitriadis, A. (Alexis)

unread,

Jun 1, 2015, 2:13:25 PM6/1/15

to <nltk-users@googlegroups.com>

I am doing some searches on the XML version of the British National Corpus (not distributed with the NLTK), using the NLTK's BNCCorpusReader. It works fine, but it's SLOW. Simply counting the words in the first 30 files (about 950 thousand words) takes 21 seconds on my computer. By comparison, counting the words on the Brown corpus (500 plain text files, one million words) takes less than three seconds.

The problem is clearly because the BNC reader must deal with a whole bunch of XML files. So the question is: Is there any provision for speeding up access, by hooking up the BNCCorpusReader to a (faster) third-party XML library in some way?

Alexis

Reply all

Reply to author

Forward

0 new messages