Data in head doesn't match what it says in the file

27 views
Skip to first unread message

Heck Lennon

unread,
Feb 28, 2024, 10:00:06 AMFeb 28
to beautifulsoup
Hello,

I need to convert a bunch of HTML files from cp-1252 to utf-8.

Besides converting the actual contents, I need to add/edit the meta tag in the header.

For some reason, Soup says the original file is "utf-8" although it is cp-1252.

Why is that?

Thank you.

"""
input.html:
<head>
<meta http-equiv="content-type" content="text/html; charset=Windows-1252">
<title>blah</title>
</head>
etc.
"""

import os
import glob
import chardet
from bs4 import BeautifulSoup
from datetime import datetime
import tempfile
import re

file = r"c:\temp\input.html"
with open(file, 'r') as f:
  content_text = f.read()

#<meta http-equiv="content-type" content="text/html; charset=Windows-1252">
pattern_charset = re.compile('charset=(.*)')
soup = BeautifulSoup(content_text, 'html.parser')
if soup is None:
  print("Error soup")
  exit()

#print original head
#!!!! Original: <meta http-equiv="content-type" content="text/html; charset=Windows-1252">
#BS says: "<meta content="text/html; charset=utf-8" http-equiv="content-type"/>"
print(soup.head)
print("========")

meta = soup.find("meta",attrs={'http-equiv':'content-type','content':'text/html; charset=Windows-1252'})
if meta:
  print("Found:",meta.has_attr('content'))
  #edit
  meta['content'] = "text/html; charset=utf-8"
else:
  print("Not found")
  #insert new meta in head
  new_meta= soup.new_tag("meta", {'http-equiv':'content-type','content':'text/html; charset=utf-8'})
  soup.head.append(new_tag)
print(soup.head)

leonardr

unread,
Feb 28, 2024, 11:29:57 AMFeb 28
to beautifulsoup
On Wednesday, February 28, 2024 at 10:00:06 AM UTC-5 frdt...@gmail.com wrote:
Hello,

I need to convert a bunch of HTML files from cp-1252 to utf-8.

Besides converting the actual contents, I need to add/edit the meta tag in the header.

For some reason, Soup says the original file is "utf-8" although it is cp-1252.

Why is that?

Thank you.


The short answer is that Beautiful Soup is transparently converting your HTML files to UTF-8. You don't need to manually edit the meta tag.

As part of parsing a  file, Beautiful Soup converts the data from its native encoding to Unicode. At any point after that, outputting the markup--whether you are using print() or methods like encode(), decode(), or prettify()--requires specifying an output encoding. The default output encoding is UTF-8.

When the Tag object for a <meta> tag is encoded back into a string, the value of the charset parameter is set to the output encoding--not the original document encoding, which might be inaccurate.

That's why print(soup.head) gives you content="text/html; charset=utf-8". It's the same as calling print(soup.head.decode()) which is the same as calling print(soup.head.decode("utf8")). If you call print(soup.head.decode("euc-jp")) you'll get content="text/html; charset=euc-jp", and so on.

To see the original value of the "content" attribute of the <meta> tag, you can check soup.meta['content']. It's still in there, but that particular string will only be output if you ask for cp-1252 as the output encoding.
To see the original encoding of the document itself, you can check soup.original_encoding
For more information, see these sections of the Beautiful Soup documentation: Encodings and Output Encoding.

Leonard
 

Heck Lennon

unread,
Feb 28, 2024, 12:30:29 PMFeb 28
to beautifulsoup
Thanks for the infos.

So it means that BS can take care of 1) converting the file's contents from eg. cp-1252 to utf-8, and 2) add/update the meta tag by itself, so I only need to loop through *.html and let BS perform its magic?

leonardr

unread,
Feb 28, 2024, 12:48:43 PMFeb 28
to beautifulsoup
Yes, that's right. Run the file through Beautiful Soup and write it back out, and you'll get (approximately) the same document in UTF-8. However, if preserving the exact markup is important to you, it'd be better not to parse the document at all: make the character conversion with Unicode, Dammit and then use a regular expression to change the <meta> tag.

Leonard

Heck Lennon

unread,
Feb 28, 2024, 7:10:13 PMFeb 28
to beautifulsoup
Fantastic! Thank you.
Reply all
Reply to author
Forward
0 new messages