I am trying to use BeautifulSoup to strip out all the markups in html files and give me the pure text as it would be seen in a browser.
If I try the following:
from bs4 import BeautifulSoup
markup = "<html><body>This is\nsome text.</body></html>"
print(BeautifulSoup(markup, "html.parser").get_text())
The output I get is:
This is
some text.
The output I expected is:
This is some text.
I tried using `html5lib` instead of `html.parser`, but it gave me the same results.
Of course, `This is some text.` is how such an html file would be displayed in a browser.
Is it possible to do what I am trying with Beautiful Soup? If not, is there some other tool that would give me what I'm looking for?