Dumping to PlainText

390 views
Skip to first unread message

Joe

unread,
Mar 10, 2009, 11:36:57 PM3/10/09
to beautifulsoup
I'm trying to just dump plaintext from websites. I was originally
trying to use this function: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#clean_html
but it was crashing.

After some googling, I found this:
http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_23692378.html

with this code:
import urllib2
from BeautifulSoup import BeautifulSoup as BSoup

page = urllib2.urlopen('http://somedomain/index.html').read()
soup = BSoup(page, convertEntities=BSoup.HTML_ENTITIES)

# only the text nodes between body tags.
print ''.join(soup.body(text=True))

Since BeautifulSoup seems to pride itself in being able to handle any
HTML, I wasn't expecting an error on the same url. Here is the url
that I'm trying to extract text from: http://www.amazon.com/Pig-Big-Douglas-Florian/dp/0688171265

Any help would be great,
Joe

Joe

unread,
Mar 11, 2009, 12:39:16 AM3/11/09
to beautifulsoup
alright, I read through the messages and saw that 3.0.7a was better,
but I've managed to crash that also. It's able to parse the amazon
page below though.

this page this time:
http://www.engadget.com/2004/09/08/nokia-9300-continued-how-much-smaller-this-much/

I'm going to see if I can find a solution, maybe a different parser
would be better. If anyone knows of something that can strip the
plaintext out of an html document fairly reliably, I'd like to hear
about it.

Joe

Zulq Alam

unread,
Mar 11, 2009, 5:40:18 AM3/11/09
to beauti...@googlegroups.com
Hi Joe,

I think there is a bug in the unicode conversion which only occurs when
convertEntities is specified. You can see this quickly by simply
removing this parameter.

>>> soup = BeautifulSoup(page)
>>> len(soup.prettify())
82127

The page we are talking about is not UTF-8 as it declares.

>>> page.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
7586: unexpected code byte

That character is a "smart quote" (characters in the range \x80-\x9f):

>>> page[7580:7600]
' so it\x92s completely '

BeautifulSoup will try a number of encodings before giving up. In this
case it tries UTF-8, ascii and windows-1252. Normally, when it tries
windows-1252 or similar encodings it will convert these smart quotes
into a suitable entity reference, e.g. &%92; before decoding. However,
whenever convertEntities is specified smart quote conversion is never done:

if self.convertEntities:
# It doesn't make sense to convert encoded characters to
# entities even while you're converting entities to Unicode.
# Just convert it all to Unicode.
self.smartQuotesTo = None
# SNIP

Commenting out line 1063 (# self.smartQuotesTo = None) seems to resolve
the issue for me. BUT I don't fully understand the ramifications of this
change.

Hope this helps.

- Zulq

Zulq Alam

unread,
Mar 11, 2009, 5:48:06 AM3/11/09
to beauti...@googlegroups.com
Additionally, UnicodeDammit should probably have raised an error instead of returning None when all conversions have failed.

Jason Wang

unread,
Mar 11, 2009, 3:31:09 AM3/11/09
to beautifulsoup
For some reason my reply didn't show up...

You should try html5lib, its a drop in replacement for default parser
and the rest is beautifulsoup, used as follows:

from BeautifulSoup import BeautifulSoup
import urllib2
import html5lib
from html5lib import treebuilders

statement = urllib2.urlopen(URL)
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder
("beautifulsoup"))
soup = parser.parse(statement)

On Mar 11, 12:39 am, Joe <qbpro...@gmail.com> wrote:
> alright, I read through the messages and saw that 3.0.7a was better,
> but I've managed to crash that also.  It's able to parse the amazon
> page below though.
>
> this page this time:http://www.engadget.com/2004/09/08/nokia-9300-continued-how-much-smal...
>
> I'm going to see if I can find a solution, maybe a different parser
> would be better.  If anyone knows of something that can strip the
> plaintext out of an html document fairly reliably, I'd like to hear
> about it.
>
> Joe
>
> On Tue, Mar 10, 2009 at 11:36 PM, Joe <qbpro...@gmail.com> wrote:
> > I'm trying to just dump plaintext from websites.  I was originally
> > trying to use this function:http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#cl...
> > but it was crashing.
>
> > After some googling, I found this:
> >http://www.experts-exchange.com/Programming/Languages/Regular_Express...

Joe

unread,
Mar 11, 2009, 12:01:43 PM3/11/09
to beauti...@googlegroups.com
After having 2 parsers fail (HTMLParser, SGMLParser), I think I'm
going to use lynx or w3m and dump the output. It seems like the
safest option.

Joe

Z.

unread,
Mar 12, 2009, 8:04:53 AM3/12/09
to beautifulsoup
html2text or stripogram can convert html to plain text, see
http://stackoverflow.com/questions/598817/python-html-removal

On Mar 11, 7:01 pm, Joe <qbpro...@gmail.com> wrote:
> After having 2 parsers fail (HTMLParser, SGMLParser), I think I'm
> going to use lynx or w3m and dump the output.  It seems like the
> safest option.
>
> Joe
>

Pratik Dam

unread,
Mar 12, 2009, 9:03:28 AM3/12/09
to beauti...@googlegroups.com
 
 
__version__ = "3.0.4"

 from BeautifulSoup import BeautifulSoup
soup =  BeautifulSoup(urllib.urlopen(u).read())
soup.feed()

Zulq Alam

unread,
Mar 13, 2009, 6:53:03 AM3/13/09
to beauti...@googlegroups.com
Try again with the same options he used:

soup = BeautifulSoup(urllib.urlopen(u).read(),
convertEntities=BeautifulSoup.HTML_ENTITIES)
> <http://somedomain/index.html%27%29.read%28>)

Joe

unread,
Mar 13, 2009, 10:31:08 AM3/13/09
to beauti...@googlegroups.com
I got it working pretty well with lynx. I don't have the source code
with me right now, but if you're interested I can send it to the list.

Joe
Reply all
Reply to author
Forward
0 new messages