Extracting body text as it would appear in a browser

Porter Bassett

unread,

Sep 20, 2016, 11:11:55 AM9/20/16

to beautifulsoup

I am trying to use BeautifulSoup to strip out all the markups in html files and give me the pure text as it would be seen in a browser.

If I try the following:

    from bs4 import BeautifulSoup
    markup = "<html><body>This is\nsome text.</body></html>"
    print(BeautifulSoup(markup, "html.parser").get_text())

The output I get is:

    This is
    some text.

The output I expected is:

    This is some text.

I tried using `html5lib` instead of `html.parser`, but it gave me the same results.

Of course, `This is some text.` is how such an html file would be displayed in a browser.

Is it possible to do what I am trying with Beautiful Soup? If not, is there some other tool that would give me what I'm looking for?

Bill Thompson

unread,

Dec 20, 2016, 8:05:45 AM12/20/16

to beautifulsoup

All you need to do is remove the '\n' and replace it with a space. You can do it like this in python:

from bs4 import BeautifulSoup
markup = "<html><body>This is\nsome text.</body></html>"

markup = markup.replace('\n', ' ')

print(BeautifulSoup(markup, "html.parser").get_text())

This will give you:

This is some text.

I'm using python 3.4

Richard C

unread,

Aug 9, 2017, 11:21:02 AM8/9/17

to beautifulsoup

I had the same requirement trying to convert html email messages to fallback text messages, but roughly retaining the formatting.

I wrote this based on a stackexchange post (I think), but it would be really awesome if something like this were built in to bs4. I guess it's akin to rendering a html page in a text mode browser.

Reply all

Reply to author

Forward