DOM Parser

33 views
Skip to first unread message

John DeBovis

unread,
Jun 9, 2011, 2:54:30 PM6/9/11
to Pattern
I'm trying to parse an XML page and I noticed that I did not get the
desired behavior when I encounter tags that are not containers, like
<secondarystreet/>

Consider this XML file:

<Address>
<PrimaryStreet>123 WOOHOOO DR</PrimaryStreet>
<SecondaryStreet/>
<City>CHESTERFIELD</City>
<County>94</County>
<State>MO</State>
<PostalCode>63017</PostalCode>
<CountryCode>USA</CountryCode>
</Address>

My output indicates that the pattern module thinks that city, county,
state, etc, are all under secondaryStreet, which is not correct:

[document]:address:primarystreet ---> 123 WOOHOOO DR
[document]:address:secondarystreet:city ---> CHESTERFIELD
[document]:address:secondarystreet:county ---> 94
[document]:address:secondarystreet:state ---> MO
[document]:address:secondarystreet:postalcode ---> 63017
[document]:address:secondarystreet:countrycode ---> USA
[document]:phonenumbers:phonenumber --->

Are you aware of this? And is there a workaround or am I doing
something wrong?

Tom De Smedt

unread,
Jun 9, 2011, 4:12:14 PM6/9/11
to Pattern
The DOM parser is built on BeautifulSoup, which is designed for HTML.
It may not always handle XML correctly (in particular: it doesn't know
about self-closing tags unless you define them). In the latest
revision I've added an optional parameter self_closing=[] to the
Document() class which may work for you:

xml = "<address> ..."
dom = Document(xml, self_closing=["secondarystreet"])
print dom.by_tag("address")[0].children

You can get the latest revision from github now or wait for the next
official release.
Also, you can always use (for example) dom.by_tag("address")
[0].by_tag("country") to just get all the country tags inside the
first address tag, regardless of how it's nested.

robdmi...@gmail.com

unread,
Jun 9, 2011, 4:20:33 PM6/9/11
to pattern-f...@googlegroups.com
Sweet fix. Thanks for that.
Reply all
Reply to author
Forward
0 new messages