XML parsing - end of string removed from tree if space in tag

28 views
Skip to first unread message

Oriane Nedey

unread,
Apr 16, 2021, 12:34:00 PM4/16/21
to beautifulsoup
Hello,

I'm having an issue during parsing strings as XML nodes, and I'm unable to find any documentation on this issue on the web - maybe some of you can help me ?

I want to create a function that changes the name of the recognized and valid tags in a sentence.
The idea is that the script should handle both valid and invalid tags in the sentence. I came accross 2 cases in which BeautifulSoup is behaving weirdly, namely when there is a whitespace at the beginning of the opening or closing of the tag element.

1. Normal case:
```
s = "<xml>This is a test sentence with <testtag>some tag</testtag> and content after it.</xml>"
soup = BeautifulSoup(s, "xml")
soup.find_all()
```
I get the following result : [<xml>This is a test sentence with <testtag>some tag</testtag> and content after it.</xml>, <testtag>some tag</testtag>]

2. Whitespace at the beginning of the opening tag:
```
s = "<xml>This is a test sentence with < testtag>some tag</testtag> and content after it.</xml>"
soup = BeautifulSoup(s, "xml")
soup.find_all()
```
I get the following result : [<xml>This is a test sentence with </xml>]

3. Whitespace at the beginning of the closing tag:
```
s = "<xml>This is a test sentence with <testtag>some tag< /testtag> and content after it.</xml>"
soup = BeautifulSoup(s, "xml")
soup.find_all()
```
I get the following result : [<xml>This is a test sentence with <testtag>some tag</testtag></xml>, <testtag>some tag</testtag>]

--

I will try to find a way to protect those badly formed tags on my side, but I was wondering:
  1. When I try to parse the two badly formed sentences with the python library lxml directly, I get an error (I call root = lxml.etree.fromstring(s) and get an error like "lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 36"). How is it that the parsing via BeautifulSoup worked, although the documentation says that the parsing will be done with lxml ?
  2. How is that possible that the badly-formed closing tag is repaired, but the rest of the sentence simply disappears ?

For information, I need the badly formed strings to remain badly-formed (the function would be used for some evaluation purposes) - so I cannot simply work with repairing thee tags on my side.

Thank you very much if you can provide some guidance for my issue !

Best regards,
Oriane Nédey

Reply all
Reply to author
Forward
0 new messages