first I want to thank you for this valuable tool. However, while using
it the first time I was probably trapped by a feature that by mistake
looks to me
like a bug :).
Take this from the beautiful soup docs:
> doc = ['<html><head><title>Page title</title></head>',
> '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
> '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
> soup = BeautifulSoup(''.join(doc))
works like a charm. However, I normally Do get markup as a web page
containing also newlines. In the best case it is similar to
When parsing this prettified soup I get into trouble:
> soup2 = BeautifulSoup(soup.prettify())
> print soup.contents.contents # gives <head><title>Page title</title></head>
> print soup2.contents.contents # gives just a newline
The reason for this is that BeautifulSoup() "inserts" a newline
between all elements in the corresponding contents level (probably
the newline from the initial markup).
> print soup2.contents.contents
> [u'\n', <head>
> Page title
> </head>, u'\n', <body>
I am by no means interested in the newlines occurring in the original
HTML document, right? Furthermore this behavior makes it pretty much
impossible to use things like previousSibling as HTML-unrelated
newline characters might have crawled into contents.
So in my case the following preprocessor saved the day:
> def bs_preprocess(html):
> """remove distracting whitespaces and newline characters"""
> pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
> html = re.sub(pat, '', html) # remove leading and trailing whitespaces
> html = re.sub('\n', ' ', html) # convert newlines to spaces
> # this preserves newline delimiters
> html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
> html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
> return html
well but this ought to be a really common issue so I doubt there is no
better solution you might kindly suggest here.
thanks in advance for any hint,