PlaintextCorpusReader for urls

9 views

Skip to first unread message

ac

unread,

Jan 6, 2015, 2:09:11 PM1/6/15

to nltk-...@googlegroups.com

Hello All

I would like to load my own corpus, which consists of a set of html documents, and be able to apply basic nltk corpus functionality (fileids(), etc.). I have seen that PlaintextCorpusReader achieves this for plain text files stored in the root directory. Is there any function that (i) directly accesses html documents, (ii) transforms them in a format that nltk corpus functionality can be applied to, and (iii) does not require downloading the file on disk as an intermediate step?

Apologies in advance if the question is silly, but I am new to both python and nltk!