HTML Stripper not working

61 views

Skip to first unread message

Brian Ó Broin

unread,

Sep 17, 2021, 1:17:25 PM9/17/21

to nltk-users

This question is based on chapter 4 of the NLTK book online: "Writing Structured Programs", section 4.4 "Functions".

The following simple function should be stripping html code from any files passed to it, but the files I send to it are returning with the html code intact. Any suggestions about what's going wrong?

import re

def get_text(file):

"""Read text from a file, normalizing whitespace and stripping HTML markup."""

text = open(file).read()

text = re.sub(r'<.*?>', ' ', text)

text = re.sub('\s+', ' ', text)

return text

Reply all

Reply to author

Forward

0 new messages