This question is based on chapter 4 of the NLTK book online: "Writing Structured Programs", section 4.4 "Functions".
The following simple function should be stripping html code from any files passed to it, but the files I send to it are returning with the html code intact. Any suggestions about what's going wrong?
import re
def get_text(file):
"""Read text from a file, normalizing whitespace and stripping HTML markup."""
text = open(file).read()
text = re.sub(r'<.*?>', ' ', text)
text = re.sub('\s+', ' ', text)
return text