XML and regex

31 views
Skip to first unread message

Michael Bláha

unread,
Sep 7, 2017, 10:48:19 AM9/7/17
to beautifulsoup
Hello,
I've encountered a strange behavior and cannot figure out how to fix it.

If I use 'lxml' or 'html.parser' parser, the following code:

tags = ["tc:elem1", "tc:.*"]
soup = BeautifulSoup("""<tc:root xmlns:tc="http://myfaces.apache.org/tobago/component">
<tc:elem1 label="{label.test$string}" />
<tc:elem1 label="{blabla.test$string}" />
</tc:root>""", "html.parser")
print "without regex"
for tag in tags:
for el in soup.findAll(name=tag):
print el.name

print "with regex"
for tag in tags:
for el in soup.findAll(name=re.compile(tag)):
print el.name

prints out:
without regex
with regex
tc:elem1
tc:elem1
tc:root
tc:elem1
tc:elem1

If I use 'xml" parser like that:
tags = ["tc:elem1", "tc:.*"]
soup = BeautifulSoup("""<tc:root xmlns:tc="http://myfaces.apache.org/tobago/component">
<tc:elem1 label="{label.test$string}" />
<tc:elem1 label="{blabla.test$string}" />
</tc:root>""", "xml")
print "without regex"
for tag in tags:
for el in soup.findAll(name=tag):
print el.name

print "with regex"
for tag in tags:
for el in soup.findAll(name=re.compile(tag)):
print el.name

prints out:
without regex
elem1
elem1
with regex


I really want to use the 'xml' parser due to other advantages. But it does not seem to work with regex at all.
Plus, the 'lxml' and 'html.parser' behave really strange without the regex

Please, help





Reply all
Reply to author
Forward
0 new messages