Trouble manipulating the parse tree

16 views
Skip to first unread message

Jim

unread,
Feb 9, 2010, 10:13:04 AM2/9/10
to beautifulsoup
Hello,

I'm having trouble manipulating the parse tree and if someone could
give me a hint, I'd be very grateful.

I have Python2.6, and BS 3.0.7a (running on an Ubuntu system).

I have XML files, a small part of which, the contents of the
<description> tag, is actually simple HTML. I'm trying to fix those
parts up a bit. I have these two points I'm working on:
(1) I want to convert HTML entities to the associated unicode
character.
(2) the HTML was written so that paragraphs are separated by <p/> tags
and I'd like to convert to paragraphs surrounded by <p>..</p>.
There are also other issues with the files but I'm asking about an
interaction between these two.

I am using BS, thinking that it would rescue some of the bad tag
structure (some of the other issues).

I have included a program below that illustrates the problem. I have
also put the program at ftp://alan.smcvt.edu/hefferon/try.py if that
is more convienent for a person. This program accesses an XML file in
the same online directory, which I'm using to illustrate the issue.

To reproduce the problem:

If I run the program below it works great.

If I uncomment the "change all strings code" then it fails, in an
infinite loop, getting that n is the same entry over and over (the
sixth thing on the list c). My understanding is that I've replaced
each NavigableString with a NavigableString. Why does that cause it
to fail?

To further confuse me, if I uncomment the # new_string='k' then it is
back to working great.

(I've stripped out the functionality from the change-all-strings code
to make this program smaller. In my actual program it replaces the
HTML entities.)

If someone has a suggestion about how to proceed, it would be a great
help to me.

Regards,
Jim

....................................
try.py ...............................................

# /usr/local/bin/python2.6
# -*- encoding: utf-8 -*-
import urllib
import BeautifulSoup


XHTML_SELF_CLOSING_TAGS=['area', 'base', 'br', 'col', 'command',
'embed', 'eventsource', 'hr', 'img', 'input', 'link', 'meta', 'param',
'source']
def convert_description_to_xml(d):

soup=BeautifulSoup.BeautifulStoneSoup(d,selfClosingTags=XHTML_SELF_CLOSING_TAGS)
## # Begin change all strings code
## for ns in soup.findAll(text=True):
## new_string=str(ns)
## # new_string='k'
## new_ns=BeautifulSoup.NavigableString(new_string)
## if not(isinstance(ns,(BeautifulSoup.CData,
BeautifulSoup.Comment, BeautifulSoup.Declaration,
BeautifulSoup.ProcessingInstruction))):
## ns.replaceWith(new_ns)
## # End change all strings code
# Rescue the paragraph structure
no_p_contents=[] # Will get entities before the first <p ..> tag
new_description_tag=BeautifulSoup.Tag(soup,'description')
initial_p_tag=BeautifulSoup.Tag(soup,'p')
new_description_tag.insert(0,initial_p_tag)
initial=True
c=soup.description.contents
while c:
# print "the length of c is now",str(len(c))
n=c[0]
n.extract()
# print "## type(n)=",repr(type(n)),"n=",str(n) # ,"##
",str(n_copy)
if (initial
and isinstance(n,BeautifulSoup.Tag)
and n.name=='p'):
initial=False
if initial:
# print "** about to insert in the p: initial_p_tag
is",repr(initial_p_tag.contents)
# print "**
len(initial_p_tag.contents)",str(len(initial_p_tag.contents))
initial_p_tag.insert(len(initial_p_tag.contents),n)
else:

new_description_tag.insert(len(new_description_tag.contents),n)
soup.description.replaceWith(new_description_tag)
print "\n\n",str(soup.description)

if __name__=='__main__':
f=urllib.urlopen('ftp://alan.smcvt.edu/hefferon/ob_desc.xml')
d=f.read()
convert_description_to_xml(d)

Reply all
Reply to author
Forward
0 new messages