Bug when there are colons in Tag

14 views
Skip to first unread message

k s

unread,
Feb 7, 2018, 10:41:19 PM2/7/18
to beautifulsoup
I am trying to parse some XML files using bs4.  It seems that the latest version 4.6.0 on python 2.7.11 is failing to find tags with ":" in them.

Here is the code I am running:

import bs4

from bs4 import BeautifulSoup


soup = r'<ns1:entry>Test</ns1:entry>'

soup = BeautifulSoup(soup, 'lxml')

print('bs4 version:', bs4.__version__)

for tag in [tag.name for tag in soup.find_all()]:

    print('Found tags', tag)


entry_tag = soup.find_all('ns1:entry')

print('Entry Tag:', entry_tag)


============

OUTPUT:  python2.7.11, bs4 == 4.5.3

('bs4 version:', '4.5.3')

('Found tags', 'html')

('Found tags', 'body')

('Found tags', 'ns1:entry')

('Entry Tag:', [<ns1:entry>Test</ns1:entry>])


============

OUTPUT:  python2.7.11, bs4 == 4.6.0

('bs4 version:', '4.6.0')

('Found tags', 'html')

('Found tags', 'body')

('Found tags', 'ns1:entry')

('Entry Tag:', [])


============

Note that we still found the Tags ns1:entry, but when you do a find_all in the soup in the latest version 4.6.0, we get nothing returned.


Is this a bug? and I am happy to file it elsewhere if this isnt the right spot.


Thanks



Jim Tittsler

unread,
Feb 7, 2018, 11:58:49 PM2/7/18
to beautifulsoup
1. I think you will want to parse an XML file using:
soup = BeautifulSoup(soup, 'xml')
which I think is synonymous with BeautifulSoup(soup, 'lxml-xml'),
since only lxml's XML parser is supported.

2. I believe you need to define the ns1 namespace to make lxml happy.
Try parsing:
xml = r'<ns1:entry xmlns:ns1="http://www.example.com/ns1/">Test</ns1:entry>'
Or make sure the namespace you are using is declared earlier.
xml = r'<xml xmlns:ns1="http://www.example.com/ns1/"><ns1:entry>Test</ns1:entry></xml>'
Reply all
Reply to author
Forward
0 new messages