Scott Newman
unread,Dec 6, 2010, 3:21:16 PM12/6/10Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to beautifulsoup
Hello all,
I'm trying to extract some text from an XML document and replace some
of the nested tags with HTML tags. In the doc below, I just want to
replace the <inlineTag name="body"> tags with a <p> tagset, then
replace <inlineTag name="bold"> with <strong> and <inlineTag
name="italic"> with <em>. I've put a comment on what I'm trying to
achieve in the code.
I'm struggling with replacing <inlineTag name="body">. I've tried this
code but it stops at the first nested <inlineTag>:
body_tags = soup.findAll('inlinetag', {'name': 'body'},
recursive=True)
# Eventually iterate all the tags
b = body_tags[0]
# Not sure how to recursively get a reference to everything
between the tags.
# This only outputs the text up to the first nested tag:
print b.renderContents()
The first thing I'm trying to accomplish is to get the ENTIRE contents
of each <inlineTag name="body"> tag. After that, I can then do my work
on the <em> and <bold> stuff with more findAll commands.
I'm not great at conceptualizing recursion, so any guidance that can
be given would be really appreciated!
Here's what I'm working with:
raw_xml = """
<?xml version="1.0" encoding="UTF-8"?>
<export>
<publication>
<issue>
<article>
<textObjects>
<textObject>
<text>
<inlineTag name="Story">
<inlineTag name="subhead">This is the
subhead.</inlineTag>
</inlineTag>
</text>
</textObject>
<textObject>
<text>
<inlineTag name="Story">
<inlineTag name="body">
The body of our content is usually
wrapped up nicely in simple tags, but
we do have some nested tags that
render the text in <inlineTag name="italic">
italics</inlineTag> or <inlineTag
name="bold">bold.</inlineTag> It's not the
easiest stuff to work with because
you cannot be sure how deeply it is nested.
<break type="paragraph"/>
</inlineTag>
<inlineTag name="body">
Sometimes we have notes in the
text from editors and we need to remove <note>the
tag and</note> its contents. It's
really a pain. Also, text doesn't always have
an ending break tag.
</inlineTag>
<inlineTag name="body">
Sometimes we get <inlineTag
name="italics">italicized text that has <inlineTag name="bold">
bold text</inlineTag></inlineTag>
inside of it. What a bear!
<break type="paragraph"/>
</inlineTag>
</inlineTag>
</text>
</textObject>
</textObjects>
</article>
</issue>
</publication>
</export>
"""
# Here's what I'm trying to end up with:
#
# <p>
# The body of our content is usually wrapped up nicely in simple tags,
but
# we do have some nested tags that render the text in <em>italics</em>
or
# <strong>bold.</strong> It's not the easiest stuff to work with
# because you cannot be sure how deeply it is nested.
# </p>
#
# <p>
# Sometimes we have notes in the text from editors and we need to
remove its
# contents. It's really a pain. Also, text doesn't always have an
ending break tag.
# </p>
#
# <p>
# Sometimes we get <em>italicized text that has <strong> bold text</
strong></em>
# inside of it. What a bear!
# </p>
#
# There might be other pre-processing "fixes" to do here later
input_xml = raw_xml.strip()
# Parse it into a soup object so we can extract values. Make sure to
# identify the self closing tags or it will end up all crazy.
soup = BeautifulStoneSoup(input_xml, selfClosingTags=['break'])
# Remove <break type="paragraph" /> tags
for t in soup.findAll('break', {'type': 'paragraph'}):
t.extract()
# Replace <inlinetag name="bold"> with "<strong>"
for t in soup.findAll('inlinetag', {'name': 'bold'}):
strong_tag = Tag(soup, 'strong')
strong_tag.insert(0, t.contents[0])
t.replaceWith(strong_tag)
# Replace <inlinetag name="link"> with "<a>"
for t in soup.findAll('inlinetag', {'name': 'link'}):
a_tag = Tag(soup, 'a')
a_tag.insert(0, t.contents[0])
t.replaceWith(a_tag)
# Replace <inlinetag name="italic"> with "<em>"
for t in soup.findAll('inlinetag', {'name': 'italic'}):
em_tag = Tag(soup, 'em')
em_tag.insert(0, t.contents[0])
t.replaceWith(em_tag)
# Remove <note> tags
for t in soup.findAll('note'):
t.extract()