HTML Stripper not working

Skip to first unread message

Brian Ó Broin

Sep 17, 2021, 1:17:25 PM9/17/21
to nltk-users
This question is based on chapter 4 of the NLTK book online: "Writing Structured Programs", section 4.4 "Functions".
The following simple function should be stripping html code from any files passed to it, but the files I send to it are returning with the html code intact. Any suggestions about what's going wrong?

import re
def get_text(file):
      """Read text from a file, normalizing whitespace and stripping HTML markup."""
     text = open(file).read()
     text = re.sub(r'<.*?>', ' ', text)
     text = re.sub('\s+', ' ', text)
     return text
Reply all
Reply to author
0 new messages