Recursively extracting nested XML nodes and converting to HTML

27 views

Skip to first unread message

Scott Newman

unread,

Dec 6, 2010, 3:21:16 PM12/6/10

to beautifulsoup

Hello all,

I'm trying to extract some text from an XML document and replace some
of the nested tags with HTML tags. In the doc below, I just want to
replace the <inlineTag name="body"> tags with a tagset, then
replace <inlineTag name="bold"> with and <inlineTag
name="italic"> with . I've put a comment on what I'm trying to
achieve in the code.

I'm struggling with replacing <inlineTag name="body">. I've tried this
code but it stops at the first nested <inlineTag>:

body_tags = soup.findAll('inlinetag', {'name': 'body'},
recursive=True)

# Eventually iterate all the tags
b = body_tags[0]

# Not sure how to recursively get a reference to everything
between the tags.
# This only outputs the text up to the first nested tag:
print b.renderContents()

The first thing I'm trying to accomplish is to get the ENTIRE contents
of each <inlineTag name="body"> tag. After that, I can then do my work
on the and <bold> stuff with more findAll commands.

I'm not great at conceptualizing recursion, so any guidance that can
be given would be really appreciated!

Here's what I'm working with:

raw_xml = """
<?xml version="1.0" encoding="UTF-8"?>
<export>
<publication>
<issue>
<article>
<textObjects>
<textObject>
<text>
<inlineTag name="Story">
<inlineTag name="subhead">This is the
subhead.</inlineTag>
</inlineTag>
</text>
</textObject>
<textObject>
<text>
<inlineTag name="Story">
<inlineTag name="body">
The body of our content is usually
wrapped up nicely in simple tags, but
we do have some nested tags that
render the text in <inlineTag name="italic">
italics</inlineTag> or <inlineTag
name="bold">bold.</inlineTag> It's not the
easiest stuff to work with because
you cannot be sure how deeply it is nested.
<break type="paragraph"/>
</inlineTag>
<inlineTag name="body">
Sometimes we have notes in the
text from editors and we need to remove <note>the
tag and</note> its contents. It's
really a pain. Also, text doesn't always have
an ending break tag.
</inlineTag>
<inlineTag name="body">
Sometimes we get <inlineTag
name="italics">italicized text that has <inlineTag name="bold">
bold text</inlineTag></inlineTag>
inside of it. What a bear!
<break type="paragraph"/>
</inlineTag>
</inlineTag>
</text>
</textObject>
</textObjects>
</article>
</issue>
</publication>
</export>
"""

# Here's what I'm trying to end up with:
#
# 
# The body of our content is usually wrapped up nicely in simple tags,
but
# we do have some nested tags that render the text in italics
or
# bold. It's not the easiest stuff to work with
# because you cannot be sure how deeply it is nested.
# 
#
# 
# Sometimes we have notes in the text from editors and we need to
remove its
# contents. It's really a pain. Also, text doesn't always have an
ending break tag.
# 
#
# 
# Sometimes we get italicized text that has bold text
# inside of it. What a bear!
# 
#

# There might be other pre-processing "fixes" to do here later
input_xml = raw_xml.strip()

# Parse it into a soup object so we can extract values. Make sure to
# identify the self closing tags or it will end up all crazy.
soup = BeautifulStoneSoup(input_xml, selfClosingTags=['break'])

# Remove <break type="paragraph" /> tags
for t in soup.findAll('break', {'type': 'paragraph'}):
t.extract()

# Replace <inlinetag name="bold"> with ""
for t in soup.findAll('inlinetag', {'name': 'bold'}):
strong_tag = Tag(soup, 'strong')
strong_tag.insert(0, t.contents[0])
t.replaceWith(strong_tag)

# Replace <inlinetag name="link"> with "<a>"
for t in soup.findAll('inlinetag', {'name': 'link'}):
a_tag = Tag(soup, 'a')
a_tag.insert(0, t.contents[0])
t.replaceWith(a_tag)

# Replace <inlinetag name="italic"> with ""
for t in soup.findAll('inlinetag', {'name': 'italic'}):
em_tag = Tag(soup, 'em')
em_tag.insert(0, t.contents[0])
t.replaceWith(em_tag)

# Remove <note> tags
for t in soup.findAll('note'):
t.extract()

Reply all

Reply to author

Forward

0 new messages