Extracting a subtree

238 views
Skip to first unread message

yshek

unread,
Mar 26, 2009, 4:04:15 PM3/26/09
to beautifulsoup
Hi,

Reading through the documentation I couldn't find a solution for my
problem...

Is there any way to strip down a subtree from the rest of the tree and
only keep that subtree (the one extracted)?
I saw that there are ways to remove a subtree from the rest of the
tree. What I'm looking for is kind of the opposite.

Thanks!

Jason Wang

unread,
Mar 26, 2009, 5:22:48 PM3/26/09
to beautifulsoup
If you want for example a list of all tables in the document, you can
do this tables = soup("table"), which will not remove anything else,
but it will give you that particular tree and you can do further
processing.

yshek

unread,
Mar 26, 2009, 6:00:09 PM3/26/09
to beautifulsoup
Well,
what I'm trying to do is a bit more complex.
As part of an academic assignment, I'm writing a crawler that extracts
information from changing scopes.
After finding a certain pattern I'm looking for, I try to find a
certain ancestor tag and then need to perform searches only within the
scope
of that ancestor's subtree.
I guess an alternative question is - is there a way to perform pattern
matching on a specific subtree?

Thanks again!

Jason Wang

unread,
Mar 26, 2009, 7:46:57 PM3/26/09
to beautifulsoup
However, you do need to have a way of isolating that subtree. My
method will allow you to isolate that particular subtree, the whole
tree will be there if you need it, but any operation you would like to
perform will be on only that tree. So for example, the following code:

tables = soup("table")
t1 = tables[0] #gives you the first table in the html
trs = t1("tr") #gives you a list of all rows in that particular table,
not the whole file

and so on. Certainly you can use regex the same way as my examples.

Zulq Alam

unread,
Mar 30, 2009, 11:38:08 PM3/30/09
to beauti...@googlegroups.com
Wont extract work for you? Once you extract an element there's nothing
to stop you using the extracted element independently.

data = """<tree>
<subtree1>abc</subtree1>
<subtree2>def</subtree2>
<subtree3>xyz</subtree3>
</tree>"""

soup = BeautifulSoup(data)
subtree2 = soup.find('subtree2')

subtree2.extract()

>>> print soup
<tree>
<subtree1>abc</subtree1>

<subtree3>xyz</subtree3>
</tree>

>>> print subtree2
<subtree2>def</subtree2>

yshek

unread,
Mar 31, 2009, 8:35:06 AM3/31/09
to beautifulsoup
Thank you both for your response.
You made me realize that I had a little misconception about things.
Now working fine.

Thanks again.
Reply all
Reply to author
Forward
0 new messages