PlaintextCorpusReader for urls

9 views
Skip to first unread message

ac

unread,
Jan 6, 2015, 2:09:11 PM1/6/15
to nltk-...@googlegroups.com
Hello All
 
I would like to load my own corpus, which consists of a set of html documents, and be able to apply basic nltk corpus functionality (fileids(), etc.). I have seen that PlaintextCorpusReader achieves this for plain text files stored in the root directory. Is there any function that (i) directly accesses html documents, (ii) transforms them in a format that nltk corpus functionality can be applied to, and (iii) does not require downloading the file on disk as an intermediate step?
 
Apologies in advance if the question is silly, but I am new to both python and nltk!
 
Thank you very much 
Reply all
Reply to author
Forward
0 new messages