When dealing with Unicode, beautiful soup translates Unicode characters into the actual character. This is done regardless of whether they were specified as entities or not. This is to make working with Unicode easier. Notice that the output in both cases has no entities (this is of course covered in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters).
from bs4 import BeautifulSoup
print('----- Testing Input -----')
print(BeautifulSoup("<title>bö</title>", 'html.parser'))
print(BeautifulSoup("<title>bö</title>", 'html.parser'))
# ----- Testing Input -----
# <title>bö</title>
# <title>bö</title>
When appending text, you should just use natural characters and BeautifulSoup will escape them when absolutely necessary:
soup = BeautifulSoup('<title></title>', 'html.parser')
soup.title.append('bö')
print('----- Testing Output -----')
print(soup)
print('----- Testing Required Entities -----')
soup = BeautifulSoup('<title></title>', 'html.parser')
soup.title.append('<')
print(soup)
# ----- Testing Output -----
# <title>bö</title>
# ----- Testing Required Entities -----
# <title><</title>
If desired, you can force the entities in the output when encoding the output to byte string and specifying the formatter
as html
:
soup = BeautifulSoup('<title></title>', 'html.parser')
soup.title.append('bö')
print('----- Testing Forced Entities -----')
print(soup.encode(formatter='html'))
# ----- Testing Forced Entities -----
# b'<title>bö</title>'