Clemens Hermann
unread,Dec 13, 2007, 4:06:34 AM12/13/07Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to beautifulsoup
Hi,
first I want to thank you for this valuable tool. However, while using
it the first time I was probably trapped by a feature that by mistake
looks to me
like a bug :).
Take this from the beautiful soup docs:
> doc = ['<html><head><title>Page title</title></head>',
> '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
> '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
> '</html>']
>
> soup = BeautifulSoup(''.join(doc))
works like a charm. However, I normally Do get markup as a web page
containing also newlines. In the best case it is similar to
soup.prettify().
When parsing this prettified soup I get into trouble:
> soup2 = BeautifulSoup(soup.prettify())
>
> print soup.contents[0].contents[0] # gives <head><title>Page title</title></head>
> print soup2.contents[0].contents[0] # gives just a newline
The reason for this is that BeautifulSoup() "inserts" a newline
between all elements in the corresponding contents[] level (probably
the newline from the initial markup).
> print soup2.contents[0].contents
> [u'\n', <head>
> <title>
> Page title
> </title>
> </head>, u'\n', <body>
--snip--
I am by no means interested in the newlines occurring in the original
HTML document, right? Furthermore this behavior makes it pretty much
impossible to use things like previousSibling as HTML-unrelated
newline characters might have crawled into contents[].
So in my case the following preprocessor saved the day:
> def bs_preprocess(html):
> """remove distracting whitespaces and newline characters"""
> pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
> html = re.sub(pat, '', html) # remove leading and trailing whitespaces
> html = re.sub('\n', ' ', html) # convert newlines to spaces
> # this preserves newline delimiters
> html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
> html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
> return html
well but this ought to be a really common issue so I doubt there is no
better solution you might kindly suggest here.
thanks in advance for any hint,
/ch