newlines in HTML content

2105 views
Skip to first unread message

Clemens Hermann

unread,
Dec 13, 2007, 4:06:34 AM12/13/07
to beautifulsoup
Hi,

first I want to thank you for this valuable tool. However, while using
it the first time I was probably trapped by a feature that by mistake
looks to me
like a bug :).

Take this from the beautiful soup docs:

> doc = ['<html><head><title>Page title</title></head>',
> '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
> '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
> '</html>']
>
> soup = BeautifulSoup(''.join(doc))

works like a charm. However, I normally Do get markup as a web page
containing also newlines. In the best case it is similar to
soup.prettify().
When parsing this prettified soup I get into trouble:

> soup2 = BeautifulSoup(soup.prettify())
>
> print soup.contents[0].contents[0] # gives <head><title>Page title</title></head>
> print soup2.contents[0].contents[0] # gives just a newline

The reason for this is that BeautifulSoup() "inserts" a newline
between all elements in the corresponding contents[] level (probably
the newline from the initial markup).

> print soup2.contents[0].contents
> [u'\n', <head>
> <title>
> Page title
> </title>
> </head>, u'\n', <body>
--snip--

I am by no means interested in the newlines occurring in the original
HTML document, right? Furthermore this behavior makes it pretty much
impossible to use things like previousSibling as HTML-unrelated
newline characters might have crawled into contents[].
So in my case the following preprocessor saved the day:

> def bs_preprocess(html):
> """remove distracting whitespaces and newline characters"""
> pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
> html = re.sub(pat, '', html) # remove leading and trailing whitespaces
> html = re.sub('\n', ' ', html) # convert newlines to spaces
> # this preserves newline delimiters
> html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
> html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
> return html

well but this ought to be a really common issue so I doubt there is no
better solution you might kindly suggest here.

thanks in advance for any hint,

/ch

Justin

unread,
Jan 16, 2008, 3:39:16 AM1/16/08
to beautifulsoup
Ya, I have the same problem. Your preprocessor definitely helped.

Thanks,
Justin


On Dec 13 2007, 1:06 am, Clemens Hermann

Pascal Boulerie

unread,
Dec 10, 2019, 9:48:50 AM12/10/19
to beautifulsoup
I would like to know about the best method to clean a text in Python version 3 using a dedicated library (something like state-of-the-art 2019)

I was unable to find it by myself by looking for these 2 keywords :
duplicated whitespaces

(so as I said in another message, a wiki documentation or documentation wiki would really help...)
Reply all
Reply to author
Forward
0 new messages