Why doesn't BS add charset meta in header?

Heck Lennon

unread,

Apr 23, 2024, 9:43:50 AMApr 23

to beautifulsoup

Hello,

For some reason, BS doesn't add the utf-8 charset in the header when calling prettify().

Is there a setting I should use?

Thank you.

=============

ROOT = r".\input_test"

#all the files were converted previously from cp1252 to utf-8
files = [os.path.join(d, f) for d, _, files in os.walk(ROOT) for f in files if f.endswith(".html") ]
for f in files:
try:
soup = BeautifulSoup(open(f, encoding="utf-8"), "html.parser")
except Exception as error:
print(f,"\n",error)
else:
#BS doesn't add this: <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
print(soup.prettify())
break

=============

Heck Lennon

unread,

Apr 24, 2024, 4:29:53 AMApr 24

to beautifulsoup

It looks like BS updates the charset line if it's there, but doesn't add one if it isn't:

→

Heck Lennon

unread,

Apr 24, 2024, 6:02:49 AMApr 24

to beautifulsoup

Although the file was previously converted from cp1252 to utf-8 by BS, it is still unable to tell the original encoding:

=========

soup = BeautifulSoup(open(INPUTFILE), "html.parser")
#"None"!
print(soup.original_encoding)

with open(INPUTFILE,encoding='UTF-8') as ff:
#"None"!
soup = BeautifulSoup(ff, "html.parser")
print(soup.original_encoding)

=========

leonardr

unread,

Apr 24, 2024, 9:44:40 AMApr 24

to beautifulsoup

I think I can clear this up.

First, as you discovered, Beautiful Soup won't add a <meta> tag to a document that doesn't already have one. If a document does have a charset <meta> tag, Beautiful Soup will modify the charset attribute during encoding to be consistent with the output encoding, but that's as far as it will go.

Regarding .original_encoding: this attribute will be None if the document was Unicode by the time Beautiful Soup saw it, since Unicode itself doesn't have an encoding. I think that's what's happening in your case.

In your examples, you're reading from a file and converting the bytestream to Unicode before handing it to Beautiful Soup. In your first example you're doing it implicitly: by default, Python's open() function decodes incoming data using the system locale. In the second example you're explicitly setting the encoding to UTF-8. In both cases, the data coming in to Beautiful Soup is already Unicode, so as far as it knows there is no original encoding.

If you open up the file in binary mode, Beautiful Soup will do the conversion (as opposed to Python doing it) and you'll get a value for .original_encoding.

If you're trying to find the (potentially inaccurate) original value for the "charset" attribute of a <meta> tag, it's still there, so long as you don't encode the tag to some other encoding:

data = '<html><head><meta charset="some random encoding"></head></html>'
soup = BeautifulSoup(data, 'html.parser')

print(soup)

# <html><head><meta charset="utf-8"/></head></html>

print(soup.meta['charset'])

# some random encoding

print(soup.meta['charset'].original_value)

# some random encoding

The value has become a ContentMetaAttributeValue object, a chameleon object which has its original value except during encoding.

Leonard

Heck Lennon

unread,

Apr 24, 2024, 10:18:47 AMApr 24

to beautifulsoup

Thanks!

Reply all

Reply to author

Forward