Extracting body text as it would appear in a browser

1,828 views
Skip to first unread message

Porter Bassett

unread,
Sep 20, 2016, 11:11:55 AM9/20/16
to beautifulsoup
I am trying to use BeautifulSoup to strip out all the markups in html files and give me the pure text as it would be seen in a browser.

If I try the following:

    from bs4 import BeautifulSoup
    markup = "<html><body>This is\nsome text.</body></html>"
    print(BeautifulSoup(markup, "html.parser").get_text())


The output I get is:

    This is
    some text.


The output I expected is:

    This is some text.

I tried using `html5lib` instead of `html.parser`, but it gave me the same results.

Of course, `This is some text.` is how such an html file would be displayed in a browser.

Is it possible to do what I am trying with Beautiful Soup?  If not, is there some other tool that would give me what I'm looking for?

Bill Thompson

unread,
Dec 20, 2016, 8:05:45 AM12/20/16
to beautifulsoup
All you need to do is remove the '\n' and replace it with a space. You can do it like this in python:

   from bs4 import BeautifulSoup
   markup = "<html><body>This is\nsome text.</body></html>"
   markup = markup.replace('\n', ' ')
   print(BeautifulSoup(markup, "html.parser").get_text())

This will give you: 

   This is some text.

I'm using python 3.4

Richard C

unread,
Aug 9, 2017, 11:21:02 AM8/9/17
to beautifulsoup
I had the same requirement trying to convert html email messages to fallback text messages, but roughly retaining the formatting.

I wrote this based on a stackexchange post (I think), but it would be really awesome if something like this were built in to bs4. I guess it's akin to rendering a html page in a text mode browser.
Reply all
Reply to author
Forward
0 new messages