.append / .string Issues

63 views
Skip to first unread message

August O'Beirne

unread,
Jan 4, 2022, 8:32:08 PM1/4/22
to beautifulsoup
Hi everyone

I am trying to use BS4.10 to edit the <title></title> tag in a document. I would like to be able to place Unicode special characters into that title. If BS is handed the characters directly, they are changed to �. If the string is modified before hand:

'bö' to 'b&ouml;', Beautiful Soup replaces the special character ampersand with the ampersand special character code itself, such as: 'b&amp;ouml;' completely breaking the code. Please advise on how to have beautiful soup append or modify text without performing any modifications on the string.

facelessuser

unread,
Jan 4, 2022, 9:19:49 PM1/4/22
to beautifulsoup

When dealing with Unicode, beautiful soup translates Unicode characters into the actual character. This is done regardless of whether they were specified as entities or not. This is to make working with Unicode easier. Notice that the output in both cases has no entities (this is of course covered in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters).

from bs4 import BeautifulSoup
print('----- Testing Input -----')
print(BeautifulSoup("<title>bö</title>", 'html.parser'))
print(BeautifulSoup("<title>b&ouml;</title>", 'html.parser'))
# ----- Testing Input -----
# <title>bö</title>
# <title>bö</title>

When appending text, you should just use natural characters and BeautifulSoup will escape them when absolutely necessary:

soup = BeautifulSoup('<title></title>', 'html.parser')
soup.title.append('bö')
print('----- Testing Output -----')
print(soup)
print('----- Testing Required Entities -----')
soup = BeautifulSoup('<title></title>', 'html.parser')
soup.title.append('<')
print(soup)
# ----- Testing Output -----
# <title>bö</title>
# ----- Testing Required Entities -----
# <title>&lt;</title>

If desired, you can force the entities in the output when encoding the output to byte string and specifying the formatter as html:

soup = BeautifulSoup('<title></title>', 'html.parser')
soup.title.append('bö')
print('----- Testing Forced Entities -----')
print(soup.encode(formatter='html'))
# ----- Testing Forced Entities -----
# b'<title>b&ouml;</title>'
Reply all
Reply to author
Forward
0 new messages