XML and regex

31 views

Skip to first unread message

Michael Bláha

unread,

Sep 7, 2017, 10:48:19 AM9/7/17

to beautifulsoup

Hello,
I've encountered a strange behavior and cannot figure out how to fix it.

If I use 'lxml' or 'html.parser' parser, the following code:

tags = ["tc:elem1", "tc:.*"]
soup = BeautifulSoup("""<tc:root xmlns:tc="http://myfaces.apache.org/tobago/component">
                    <tc:elem1 label="{label.test$string}" />
                    <tc:elem1 label="{blabla.test$string}" />
                   </tc:root>""", "html.parser")
print "without regex"
for tag in tags:
    for el in soup.findAll(name=tag):
        print el.name

print "with regex"
for tag in tags:
    for el in soup.findAll(name=re.compile(tag)):
        print el.name

prints out:
without regex
with regex
tc:elem1
tc:elem1
tc:root
tc:elem1
tc:elem1

If I use 'xml" parser like that:

tags = ["tc:elem1", "tc:.*"]
soup = BeautifulSoup("""<tc:root xmlns:tc="http://myfaces.apache.org/tobago/component">
                    <tc:elem1 label="{label.test$string}" />
                    <tc:elem1 label="{blabla.test$string}" />
                   </tc:root>""", "xml")
print "without regex"
for tag in tags:
    for el in soup.findAll(name=tag):
        print el.name

print "with regex"
for tag in tags:
    for el in soup.findAll(name=re.compile(tag)):
        print el.name

prints out:
without regex
elem1
elem1
with regex

I really want to use the 'xml' parser due to other advantages. But it does not seem to work with regex at all.
Plus, the 'lxml' and 'html.parser' behave really strange without the regex

Please, help

Reply all

Reply to author

Forward

0 new messages