Beautiful Soup 4.13.0

75 views
Skip to first unread message

leonardr

unread,
Feb 2, 2025, 2:38:45 PMFeb 2
to beautifulsoup

After a beta period lasting nearly a year, I've released the biggest update to Beautiful Soup in many years. For version 4.13.0 I added type hints to the Python code, and in doing so uncovered a large number of very small inconsistencies in the code. I've fixed the inconsistencies, but the result is a larger-than-usual number of deprecations and changes that may break backwards compatibility.

The CHANGELOG for 4.13.0 is quite large so this announcement highlights just the most important changes, specifically the changes most likely to make you need (or want) to change your code.

Deprecations and backwards-incompatible changes

  • DeprecationWarning is issued on use for every deprecated method, attribute and class from the 3.0 and 2.0 major versions of Beautiful Soup. These have been deprecated for at least ten years, but they didn't issue DeprecationWarning when you tried to use them. Now they do, and they're all going away soon.
  • This version drops support for Python 3.6, which went EOL in December 2021. The minimum supported major Python version for Beautiful Soup is now Python 3.7, which went EOL in June 2023.
  • The storage for a tag's attribute values now modifies incoming values to be consistent with the HTML or XML spec. This means that if you set an attribute value to a number, it will be converted to a string immediately, rather than being converted when you output the document.

    More importantly for backwards compatibility, setting an HTML attribute value to True will set the attribute's value to the appropriate string per the HTML spec. Setting an attribute value to False or None will remove the attribute value from the tag altogether, rather than (effectively, as before) setting the value to the string "False" or the string "None".

    This means that some programs that modify documents will generate different output than they would in earlier versions of Beautiful Soup, but the new documents are more likely to represent the intent behind the modifications.

  • If you pass an empty list as the attribute value when searching the tree, you will now find all tags which have that attribute set to a value in the empty list--that is, you will find nothing. This is consistent with other situations where a list of acceptable values is provided. Previously, an empty list was treated the same as None and False, and you would have found the tags which did not have that attribute set at all.
  • When using one of the find() methods or creating a SoupStrainer, if you specify the same attribute value in attrs and the keyword arguments, you'll end up with two different ways to match that attribute. Previously the value in keyword arguments would override the value in attrs.
  • The 'html5' formatter is now much less aggressive about escaping ampersands, escaping only the ampersands considered "ambiguous" by the HTML5 spec (which is almost none of them). This is the sort of change that might break your unit test suite, but the resulting markup will be much more readable and more HTML5-ish.

    In the future, the 'html5' formatter may be become the default HTML formatter, which will change Beautiful Soup's default output. This will break a lot of test suites so it's not going to happen for a while.

New features

  • The new ElementFilter class encapsulates Beautiful Soup's rules about matching elements and deciding which parts of a document to parse. This gives you direct access to Beautiful Soup's low-level matching API. See the documentation for details.
  • The new PageElement.filter() method provides a fully general way of finding elements in a Beautiful Soup parse tree. You can specify a function to iterate over the tree and an ElementFilter to determine what matches.
  • The NavigableString class now has a .string property which returns the string itself. This makes it easier to iterate over a mixed list of Tag and NavigableString objects.
  • Defined a new warning class, UnusualUsageWarning, which is a superclass for all of the warnings issued when Beautiful Soup notices something unusual but not guaranteed to be wrong, like markup that looks like a URL (MarkupResemblesLocatorWarning) or XML being run through an HTML parser (XMLParsedAsHTMLWarning).

    The text of these warnings has been revamped to explain in more detail what is going on, how to check if you've made a mistake, and how to make the warning go away if you are acting deliberately.

    If these warnings are interfering with your workflow, or simply annoying you, you can filter all of them by filtering UnusualUsageWarning, without worrying about losing the warnings Beautiful Soup issues when there *definitely* is a problem you need to correct, such as use of a deprecated method.

  • Emit an UnusualUsageWarning if the user tries to search for an attribute called _class; they probably mean class_.
Reply all
Reply to author
Forward
0 new messages