XML parsing - end of string removed from tree if space in tag

28 views

Skip to first unread message

Oriane Nedey

unread,

Apr 16, 2021, 12:34:00 PM4/16/21

to beautifulsoup

Hello,

I'm having an issue during parsing strings as XML nodes, and I'm unable to find any documentation on this issue on the web - maybe some of you can help me ?

I want to create a function that changes the name of the recognized and valid tags in a sentence.

The idea is that the script should handle both valid and invalid tags in the sentence. I came accross 2 cases in which BeautifulSoup is behaving weirdly, namely when there is a whitespace at the beginning of the opening or closing of the tag element.

1. Normal case:

```

s = "<xml>This is a test sentence with <testtag>some tag</testtag> and content after it.</xml>"

soup = BeautifulSoup(s, "xml")

soup.find_all()

```

I get the following result : [<xml>This is a test sentence with <testtag>some tag</testtag> and content after it.</xml>, <testtag>some tag</testtag>]

2. Whitespace at the beginning of the opening tag:

```

s = "<xml>This is a test sentence with < testtag>some tag</testtag> and content after it.</xml>"

soup = BeautifulSoup(s, "xml")

soup.find_all()

```

I get the following result : [<xml>This is a test sentence with </xml>]

3. Whitespace at the beginning of the closing tag:

```

s = "<xml>This is a test sentence with <testtag>some tag< /testtag> and content after it.</xml>"

soup = BeautifulSoup(s, "xml")

soup.find_all()

```

I get the following result : [<xml>This is a test sentence with <testtag>some tag</testtag></xml>, <testtag>some tag</testtag>]

I will try to find a way to protect those badly formed tags on my side, but I was wondering:

When I try to parse the two badly formed sentences with the python library lxml directly, I get an error (I call root = lxml.etree.fromstring(s) and get an error like "lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 36"). How is it that the parsing via BeautifulSoup worked, although the documentation says that the parsing will be done with lxml ?
How is that possible that the badly-formed closing tag is repaired, but the rest of the sentence simply disappears ?

For information, I need the badly formed strings to remain badly-formed (the function would be used for some evaluation purposes) - so I cannot simply work with repairing thee tags on my side.

Thank you very much if you can provide some guidance for my issue !

Best regards,

Oriane Nédey

Reply all

Reply to author

Forward

0 new messages