I'm new to Beautiful Soup (and to Python). I'm liking it a lot, but I
just ran into a problem:
Newlines in the HTML file you're parsing are treated, at least in some
contexts, as siblings of tags.
I have a workaround, but I'm wondering if there's a better approach.
Here's an example to show the problem. Take this HTML file and name
it tmp.html:
<html>
<head></head>
<body>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<pre>
I want to retain
the blank line in this.
</pre>
</body>
</html>
Then run the following code:
from BeautifulSoup import BeautifulSoup, Tag, NavigableString, Comment
myFile = open('tmp.html', 'r')
myFile_doc = myFile.read()
myFile.close()
soup = BeautifulSoup(myFile_doc)
print 'Contents:'
print soup.body.contents
print
item = soup.p
while item:
print 'Item:'
print item
print '------'
print
item = item.nextSibling
In the output, the contents includes a bunch of u'\n' items that I
don't want. So if I'm iterating over siblings, a bunch of the siblings
end up being newlines.
There was some discussion of this issue in this group a couple years
ago:
http://groups.google.com/group/beautifulsoup/browse_thread/thread/177b1d80e6d76cee/
The person who posted that discussion suggested preprocessing the HTML
to remove whitespace and newlines before parsing. But that doesn't
work when there are newlines that you do want to retain, as in the pre
tag in my example above.
My workaround is just to test each new sibling to see if it's a
newline, and ignore it if it isn't:
while item:
if item != '\n':
print 'Item:'
print item
print '------'
print
item = item.nextSibling
Which is probably good enough for my purposes. But I figured it
couldn't hurt to ask if there's a better and more Soupiful way to deal
with the newlines.
Any ideas?
thanks,
--jed