Dealing with newlines as siblings

2,309 views
Skip to first unread message

Jed

unread,
Aug 9, 2010, 9:15:49 PM8/9/10
to beautifulsoup
I'm new to Beautiful Soup (and to Python). I'm liking it a lot, but I
just ran into a problem:

Newlines in the HTML file you're parsing are treated, at least in some
contexts, as siblings of tags.

I have a workaround, but I'm wondering if there's a better approach.

Here's an example to show the problem. Take this HTML file and name
it tmp.html:


<html>
<head></head>
<body>
<p>Paragraph 1</p>

<p>Paragraph 2</p>
<pre>
I want to retain

the blank line in this.
</pre>
</body>
</html>


Then run the following code:


from BeautifulSoup import BeautifulSoup, Tag, NavigableString, Comment

myFile = open('tmp.html', 'r')
myFile_doc = myFile.read()
myFile.close()

soup = BeautifulSoup(myFile_doc)

print 'Contents:'
print soup.body.contents
print

item = soup.p
while item:
print 'Item:'
print item
print '------'
print
item = item.nextSibling


In the output, the contents includes a bunch of u'\n' items that I
don't want. So if I'm iterating over siblings, a bunch of the siblings
end up being newlines.

There was some discussion of this issue in this group a couple years
ago:

http://groups.google.com/group/beautifulsoup/browse_thread/thread/177b1d80e6d76cee/

The person who posted that discussion suggested preprocessing the HTML
to remove whitespace and newlines before parsing. But that doesn't
work when there are newlines that you do want to retain, as in the pre
tag in my example above.

My workaround is just to test each new sibling to see if it's a
newline, and ignore it if it isn't:

while item:
if item != '\n':
print 'Item:'
print item
print '------'
print
item = item.nextSibling

Which is probably good enough for my purposes. But I figured it
couldn't hurt to ask if there's a better and more Soupiful way to deal
with the newlines.

Any ideas?

thanks,

--jed

Reck

unread,
Aug 10, 2010, 9:30:31 AM8/10/10
to beautifulsoup
When I boil down to the pre tags I get it all in one Navigable String
Object...

>>> soup.contents[0].contents[3].contents[5].contents[0]
u' \nI want to retain \n\nthe blank line in this. \n'

There is your string that you want, I have never used the nextSibling
method of getting at the information I want.

Now that method is impractical, but you can get the same thing out of
it if you were doing a findAll...

>>> item = soup.findAll('pre')
>>> item[0]
<pre>
I want to retain

the blank line in this.
</pre>
>>>

What version of Beautiful Soup are you using?

Let me know if that helps.

-Reck
> http://groups.google.com/group/beautifulsoup/browse_thread/thread/177...

Jed

unread,
Aug 13, 2010, 3:07:34 PM8/13/10
to beautifulsoup
Thanks for the note!

Unfortunately, findAll isn't usable for my purposes; I need to iterate
through siblings to process all the lines in the file, in order. (For
example, I need to associate all the elements that come between two
headings with the preceding heading and with each other; given that
the elements after a heading aren't structurally nested inside that
heading, I don't think there's a way to do that with Beautiful Soup
other than to iterate over items using nextSibling.)

Just to be clear, the reason I included the newlines in my pre tag
wasn't to say that the newlines get taken away by Beautiful Soup; it
was to point out that the workaround proposed a while back, of
removing all the newlines from the original file before processing
with Soup, doesn't work if you need to preserve newlines inside tags.

I'm not really clear on why newlines outside of tags are treated as
siblings in the first place; they're not part of the HTML structure.
But given that they are treated as siblings, I'd like to find a good
way to ignore newlines outside of tags while not discarding newlines
inside of pre tags.

For now, I'll just stick with the "check whether it's a newline and
discard it if so" approach. But if anyone has any other ideas, let me
know.

thanks,

--jed
Reply all
Reply to author
Forward
0 new messages