Speeding up XML corpus access

15 views
Skip to first unread message

Dimitriadis, A. (Alexis)

unread,
Jun 1, 2015, 2:13:25 PM6/1/15
to <nltk-users@googlegroups.com>
I am doing some searches on the XML version of the British National Corpus (not distributed with the NLTK), using the NLTK's BNCCorpusReader. It works fine, but it's SLOW. Simply counting the words in the first 30 files (about 950 thousand words) takes 21 seconds on my computer. By comparison, counting the words on the Brown corpus (500 plain text files, one million words) takes less than three seconds. 

The problem is clearly because the BNC reader must deal with a whole bunch of XML files. So the question is: Is there any provision for speeding up access, by hooking up the BNCCorpusReader to a (faster) third-party XML library in some way?

Alexis



Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

Reply all
Reply to author
Forward
0 new messages