Right way to remove duplicates in head?

Heck Lennon

unread,

Apr 24, 2024, 9:27:34 AMApr 24

to beautifulsoup

Hello,

Running the script adds an element in the header each time it's run.

What is the right way to remove duplicates?

find_all() returns a ResultSet, so the code below doesn't work.

Thank you.

===========

for f in files:
print(f)
try:
soup = BeautifulSoup(open(f, encoding="utf-8"), "html.parser")
except Exception as error:
print(f,"\n",error)
exit()
else:
#<meta http-equiv="content-type" content="text/html; charset=utf-8">
meta = soup.find("meta", {"http-equiv":"content-type"})
if not meta:
metatag = soup.new_tag('meta')
metatag.attrs['http-equiv'] = 'Content-Type'
metatag.attrs['content'] = 'text/html; charset=utf-8'
soup.head.append(metatag)
else:
#If script run more than once, BS adds new element...
#skip first element, remove others
#TODO doesn't work
#find_all() returns a ResultSet
metas = soup.find_all("meta", {"http-equiv":"content-type"})
#wrong number of elements in list
print("Found: ",len(metas))
#AttributeError: 'NoneType' object has no attribute 'count'
#print(soup.ResultSet.count(metas))
#AttributeError: type object 'BeautifulSoup' has no attribute 'ResultSet'
print(BeautifulSoup.ResultSet.count(metas))
for meta in metas[1:]:
meta.decompose()

===========

Heck Lennon

unread,

Apr 24, 2024, 10:17:30 AMApr 24

to beautifulsoup

Found it: While the doco says parsers turn elements to lower-case, they don't touch the values in strings.

I missed that the input files used "Content-Type", so find_all() failed when looking for "content-type":

======

print(soup.head.prettify())
meta = soup.head.find("meta", {"http-equiv":"Content-Type"})

if not meta:
metatag = soup.new_tag('meta')
metatag.attrs['http-equiv'] = 'Content-Type'
metatag.attrs['content'] = 'text/html; charset=utf-8'
soup.head.append(metatag)
else:

#If script run more than once, BS adds line...

#! values must match upper/lowercase
metas = soup.find_all("meta", {"http-equiv":"Content-Type"})
print("Found: ",len(metas))

#skip first element, remove others

for meta in metas[1:]:
meta.decompose()

print(soup.head.prettify())

======

Chris Papademetrious

unread,

Apr 28, 2024, 8:34:43 AMApr 28

to beautifulsoup

I'm glad you figured it out!

In the future, if you need to perform case-insensitive string matches (e.g. your data contains a mix of conventions), you can match against a regular expression:

soup.find_all("meta", {"http-equiv": re.compile(...)})

where the regular expression is something like:

re.compile('your-text-here", flags=re.IGNORECASE)

or more concisely:

re.compile('(?i:your-text-here)")

I prefer the expanded form, as I think it's easier for others to understand the code.

- Chris

Reply all

Reply to author

Forward