Right way to remove duplicates in head?

18 views
Skip to first unread message

Heck Lennon

unread,
Apr 24, 2024, 9:27:34 AMApr 24
to beautifulsoup
Hello,

Running the script adds an element in the header each time it's run.

What is the right way to remove duplicates?

find_all() returns a ResultSet, so the code below doesn't work.

Thank you.

===========
for f in files:
  print(f)
  try:
    soup = BeautifulSoup(open(f, encoding="utf-8"), "html.parser")
  except Exception as error:
    print(f,"\n",error)
    exit()
  else:
    #<meta http-equiv="content-type" content="text/html; charset=utf-8">
    meta = soup.find("meta",  {"http-equiv":"content-type"})
    if not meta:
      metatag = soup.new_tag('meta')
      metatag.attrs['http-equiv'] = 'Content-Type'
      metatag.attrs['content'] = 'text/html; charset=utf-8'
      soup.head.append(metatag)
    else:
      #If script run more than once, BS adds new element...
      #skip first element, remove others
      #TODO doesn't work
      #find_all() returns a ResultSet
      metas = soup.find_all("meta",  {"http-equiv":"content-type"})
      #wrong number of elements in list
      print("Found: ",len(metas))
      #AttributeError: 'NoneType' object has no attribute 'count'
      #print(soup.ResultSet.count(metas))
      #AttributeError: type object 'BeautifulSoup' has no attribute 'ResultSet'
      print(BeautifulSoup.ResultSet.count(metas))
      for meta in metas[1:]:
        meta.decompose()
===========

Heck Lennon

unread,
Apr 24, 2024, 10:17:30 AMApr 24
to beautifulsoup
Found it: While the doco says parsers turn elements to lower-case, they don't touch the values in strings.

I missed that the input files used "Content-Type", so find_all() failed when looking for "content-type":

======
print(soup.head.prettify())
meta = soup.head.find("meta",  {"http-equiv":"Content-Type"})

if not meta:
  metatag = soup.new_tag('meta')
  metatag.attrs['http-equiv'] = 'Content-Type'
  metatag.attrs['content'] = 'text/html; charset=utf-8'
  soup.head.append(metatag)
else:
  #If script run more than once, BS adds line...

  #! values must match upper/lowercase
  metas = soup.find_all("meta",  {"http-equiv":"Content-Type"})
  print("Found: ",len(metas))


  #skip first element, remove others
  for meta in metas[1:]:
    meta.decompose()
print(soup.head.prettify())
======

Chris Papademetrious

unread,
Apr 28, 2024, 8:34:43 AMApr 28
to beautifulsoup
I'm glad you figured it out!

In the future, if you need to perform case-insensitive string matches (e.g. your data contains a mix of conventions), you can match against a regular expression:

soup.find_all("meta", {"http-equiv": re.compile(...)})

where the regular expression is something like:

re.compile('your-text-here", flags=re.IGNORECASE)

or more concisely:

re.compile('(?i:your-text-here)")

I prefer the expanded form, as I think it's easier for others to understand the code.

 - Chris

Reply all
Reply to author
Forward
0 new messages